By Vivek Katial (Executive Director of GDI & Data Scientist at Multitudes)
This piece was originally published on the Multitudes blog, a company working to unlock happier, higher-performing teams and adapted for a NFP context by weaving in learnings unlocked by the Good Data Institute. The article has been cross-posted with permission, and the original piece can be found here.
From controlling access to information to how resources are allocated, the influence and control of algorithms are ubiquitous in our lives today. Yet, despite technology’s ability to help humans, it can also perpetuate and exacerbate issues of systemic oppression and bias.
At the Good Data Institute, our volunteer team works with not for profit data that can be sensitive, so it’s important that we constantly think about data ethics. In this blog post, we discuss the steps involved in developing a machine learning product, how biases can be introduced during this process (algorithmic bias), ways to mitigate or eliminate these biases, and how we’re putting these principles into practice at the Good Data Institute.
Harmful impacts of algorithmic bias gaining headlines in the press
What does algorithmic bias actually mean?
Algorithmic bias refers to the ability of algorithms to systematically and repeatedly produce outcomes that benefit one particular group over another. For example, consider the task of image classification using the following two images:
Both are typical wedding photographs, one from a Western wedding and another from an Indian wedding. However, when neural networks were trained on ImageNet — one of the world’s most widely-used open-source datasets that contains more than 14 million images — they produced very different predictions for these two images (read the paper here).
Predictions on the image of the Western bride included labels such as “bride”, “wedding”, “ceremony”. In contrast, for the woman wearing a traditional Indian wedding dress, the predicted labels were “costume”, “performing arts”, “event”. Though this example may be considered trivial by some, such errors are not uncommon, especially given the lack of diversity in the datasets that data scientists typically use for the purposes of model building. So it is not a surprise that there are already many examples in society where algorithms have harmed marginalized groups (see here, and here).
The Machine Learning Lifecycle
The machine learning (ML) lifecycle can be broken up into the following five steps:
Data collection and preparation
Feature engineering
Model evaluation
Model interpretation and explainability
Model deployment
Biases can arise within each step in the ML lifecycle, so we need steps to mitigate them.
1. Data Collection and Preparation
What the step means
This is the first step in the ML process, where one collects, labels and prepares data for modelling and analysis purposes.
Data can come from a variety of sources, including real-world usage data, survey data, public data sets, and simulated data. Choosing the source for your data will depend on availability and what makes sense for your specific project.
An example of how bias can arise
Issues arise when the data collected doesn’t fully reflect the real world. For example, studies about such biases have demonstrated that almost all of the very popular image datasets – ImageNet, Coco and OpenImages – contain images mostly from Europe and North America, despite the majority of the world’s population being in Asia. As such, models trained on these datasets perform worse for people from continents such as Asia and Africa.
Note that there are many more ways that bias can leak into the data collection process; this is just one example.
Density maps showing the geographical distribution of images in popular image datasets. A world population density map is shown for reference (bottom-right).
Example of an action to mitigate this
Ensure that the data is collected in a manner that reflects reality. In fact, because historical data is collected from a society in which systems of oppression operate, you might even want to over/under-sample data from marginalized groups in order to move towards more equitable datasets. One example of when you might want to do this would be for a facial recognition app: Since there are fewer images of BIPOC folks, you might want to oversample images of them. A fantastic resource to learn more about equitable data collection practices is Timnit Gebru’s article “Datasheets for Datasets” which proposes a framework for transparent and accountable data collection.
Moreover, to help ensure optimal results, it’s essential that organizations have tech teams with diverse members in charge of both building models and creating training data. If training data is collected or processed by external partners, it is important to recruit diversified crowds so data can be more representative.
2. Feature Engineering and Selection
What the step means
When building models for the purposes of prediction, we construct features. A feature is simply a characteristic of each data point that might help for the purposes of prediction. For example, if we are predicting the expected average donation amount by person, a useful feature might be the age of the individual or the postcode the individual lives in.
Example of how bias can arise
Consider a model that predicts whether a police officer should be deployed in a particular suburb based on past incarceration data: A data scientist may claim to have built a model which is “socially neutral” as they have removed all features that correspond to race, age, and gender. However, other features like postal code might also correlate with features such as race (because in the real world, suburbs are segregated by race). In fact, this study
demonstrates the potential for predictive policing to propagate and exacerbate racial biases in law enforcement.
Example of an action to mitigate this
The easiest counter-measure is to critically examine the relationships between features that may correlate. In the example above, despite removing features about race, age and gender in the modelling process, one should still look for other features (such as postcode) that correlate with demographic features. It’s also important to initiate and maintain contact with communities and stakeholders from different marginalised groups and have a participatory approach to ML. This paper introduces Community Based System Dynamics (CBSD) as a way to engage different groups in order to design fairer ML systems. As such when designing and deciding on features for models, it’s important to engage with the community who would be most impacted by the model and get their feedback. Even still, it is not clear that this is sufficient to eliminate all biases.
3. Model Evaluation
What the step means
This involves assessing the accuracy of a model’s ability to predict a certain outcome – for example correctly predicting a person’s face in a facial recognition system.
Example of how bias can arise
In a recent study, the authors discuss intersectional model analysis as a tool to assess model accuracy, inspired by the sociological framework of intersectionality.
“Intersectionality means that people can be subject to multiple, overlapping forms of oppression, which interact and intersect with each other.” - Kimberly Crenshaw
In the “Gender Shades Project”, researchers used this approach to examine companies who were selling facial recognition technologies that boasted accuracies of up to 90%. However, when the accuracy was broken down by different intersectional sub-groups, it was found that the error rates for darker-skinned women were as high as 34.7% – whereas for lighter-skinned males, the error rate was only 0.8%. In hindsight, this is hardly something a multi-trillion-dollar business should be selling at scale, let alone promoting as “accurate”.
Example of an action to mitigate this
It’s necessary for Data Scientists to advocate for measures of model performance that contain results broken down by intersectional subgroups. This is another reason why having a representative dataset matters – so there’s enough data to evaluate the model’s accuracy for different demographic groups. The approach of using model cards discussed here is a great resource for evaluating model performance.
It is also important to gather a diverse ML team that asks diverse questions. We all bring different experiences and ideas to the workplace. People from diverse backgrounds – be that race, gender, age, experience, culture, etc. – will inherently ask different questions and interact with your model in contrasting ways. That can help one catch problems before your model is in production. At the Good Data Institute, our volunteer community of over 100+ D&A professionals come from a variety of ages, genders, race and culture.
4. Model Interpretation and Explainability
What the step means
Model explainability is a concept which looks into the ability to understand the results of a machine learning model. The extent to which a model's results are explainable to stakeholders should be a key consideration when evaluating different models, especially in human-centric applications.
Example of how bias can arise
Many examples exist of individuals being unfairly impacted due to the output of a model. In 2007 a teacher was fired from a Washington DC school due to an algorithm: Despite having highly favourable reviews from students and parents, an opaque algorithm was used to determine her performance as being in the bottom 2% of all teachers.
Examples of an action step to mitigate this
When humans interact with ML systems it is imperative they understand exactly how and what personal data will be used and also why a model is being used in the first place.
Product people, software developers, and designers should have a high-level understanding of the ML system they are building, so they can probe what data is being used, and all the ways that the model's predictions might impact an end user's decisions in the real world.
For data scientists, there are many tools available to help understand model behaviour such as SHAP (SHapley Additive exPlanations), which allows for an understanding of the effect of different model features. When utilising techniques such as deep learning models – where the models identify and abstract features in the data that humans wouldn’t be able to – data scientists can utilise tools such as LIME, which was designed to work on any black-box algorithms.
5. Model Deployment
What the step means
Once we’ve trained up a model, evaluated that it is working effectively, and completed the R&D process, models are then deployed into production.
Example of how bias can arise
At this step, it is important to ensure that the model is being used for its intended purposes. Often we can introduce bias into a model because there are inconsistencies between the problem a model was built to solve and the way it is used in the real world. This is especially the case when it is developed and evaluated in a totally self-contained environment, when in reality it exists as part of a complex social system with many decision-makers. For example, Microsoft’s NLP bot learnt racial slurs in less than 24 hours of being exposed to Twitter. Another issue is that data in production drifts over time – a phenomenon known as concept drift. The result is a degradation in model performance.
Example of an action step to mitigate this
It is necessary to consistently track the quality of the input data. Without robust monitoring in place, the distribution of the input data can revert to becoming more biased, even if the model creators ensured diversity in the initial dataset. This causes a model to be less performant for certain demographics, which means that previous work done to ensure that ethical considerations were managed appropriately can become irrelevant. We can track this by comparing the distribution of new input data from production with the training data used in model development. It’s also important to label, version, and date the models being used in production so it is easy to roll back and even switch off models that are performing poorly.
How we’re putting this into practice at The Good Data Institute
For us at The Good Data Institute, one of our data principles is that if we work with a Not For Profit organisation’s data, we should make sure that they get value and impact from our collaboration, and that we use the data transparently and anonymously. In addition, we never work with individual identifiable data – such as donation data - all individual data is always anonymised.
On occasion, our volunteer teams will build a ML products for a Not For Profit organisation. This is not our typical project, but may increase in use case over time as the data maturity of NFP organisations continues to climb. In these instances:
Data Collection and Preparation:
If data labelling is required, we divide it up across a diverse group of people. This is important to minimise the bias that our model learns from the labelled data. This is where the diversity of our 100+ volunteer community is beneficial, and D&I is a key pillar in our recruitment metrics.
Feature engineering & selection
We attempt to critically examine the relationships between features that may correlate. We also consider how we can initiate and maintain contact with communities and stakeholders from different marginalised groups and have a participatory approach to ML
Model evaluation
We often consider doing intersectional evaluation of all of our models on our own team (since we have our own demographic data and a diverse team).
When surveying and analysing our own performance as an organisation, we always look at how volunteers from diverse backgrounds reporeted on our performance.
Model deployment
We consider bad actor exercises where we identify nefarious ways that a bad actor could use data from our tool. This has helped us navigate difficult product considerations. At the end of the day, humans are the ones who create algorithms, so we also recognize the importance of the broader culture and environment we create at The Good Data Institute. Some things we consider are:
Thinking about the people we have in the room. This means reducing bias in recruitment and doing proactive outreach to under-represented populations when getting new volunteers. It also means consciously creating an environment that listens to those different voices, e.g., considering the share of voice in our team meetings; and making sure to rotate “office housework” tasks, which people from marginalized groups are often expected to do without the appropriate compensation.
Committing to doing ongoing learning about oppression and privilege. For example, our leadership team and GDI Fellows (Project Leaders) will be involved in learning about a social cause this year via facilitated training sessions.
Conclusions
This article has been a broad overview of some of the ethical pitfalls of machine learning systems. Our hope is to provide points to consider when dealing with ML Systems, as well as an example of how we’re considering implementing these mitigations so far at The Good Data Institute.
However, the subject of “Equity and Accountability in AI” is a vast and well-studied field and we’ve hardly scratched the surface. We hope this encourages everyone from AI researchers to end-users and the general public to have sustained dialogue on the importance of ethical considerations when building and interacting with ML systems.
Moreover, it’s worth noting that reducing algorithmic bias is not the full answer – the bigger, more important task is to dismantle systemic oppression. As individuals and as a collective, we can take action to create a more equitable world by making choices in what we consume, how we live, how we work, who we vote for and where we volunteer our time.
We are thankful for D&A professionals who are choosing to speak with their feet by joining our community at the Good Data Institute.
Acknowledgement
GDI does not and shall not discriminate on the basis of race, color, religion (creed), gender, gender expression, age, national origin (ancestry), disability, marital status, sexual orientation, or military status, in any of its activities or operations and supports all people as equals.
GDI acknowledges the traditional custodians of Country throughout Australia and recognises their unique cultural and spiritual relationships to the land, waters and seas and their rich contribution to society. We pay our respects to ancestors and Elders, past, present, and emerging.
GDI acknowledges and respects ngā iwi Māori as the Tangata Whenua of Aotearoa and is committed to upholding the principles of the Treaty of Waitangi.
Resources to learn more
This article barely touched the surface of this vast topic. Here are a bunch of resources that you can use to find out more about equity and accountability in data science - and we’ll keep sharing our learnings and approach as we go!
Examples of Unethical AI systems in society
Predictive Policing resulting in the shooting of Robert McDaniel (Chicago)
Individual wrongfully accused of committing a crime he didn’t commit (Michigan)
Federal Study confirming racial bias in Facial Recognition Systems
Google Facial Recognition Software Predicts Black Woman as a Gorilla
Teacher fired from a Washington DC school based on the output of an algorithm
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
Research Groups and Organisations
Toolkits, Code and Other Fun Stuff
Books
Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (Virginia Eubanks)
Weapons of Maths Destruction (Cathy O’Neil)
Algorithms of Oppression (Safiya Umoja Noble)
Data Feminism (Catherine D'Ignazio and Lauren Klein)
Comments