Nothing has transformed organizations and businesses in the 21st century as profoundly as the growing accessibility of data. In 2022, the value of the global big data analytics market reached almost $272 billion, with a forecasted CAGR of 13.5% for the upcoming few years. Putting the numbers into context, it is around four times more than the annual GDP of Lithuania, a small EU country.
However, as with every new technology, people quickly started raising questions about undesirable or simply unexpected negative consequences of big data. Most of the attention has been given to security and privacy issues, mass state surveillance, and huge energy consumption. Yet, there wasn’t a lot of talk about data ethics and such issues as data-driven social bias – at least, not until the recent deployment of publicly available artificial intelligence (AI) solutions.
Today, AI and machine learning (ML) technologies are the main ‘consumers’ of big data. Recurrent cases of biased AI outputs provoked a discussion of whether big-data-driven systems and data itself can be unfair and, if yes, how we could improve them.
The challenge of the unconscious bias
The most basic explanation of fair data is based on the concept of bias – fairness in data is using it in a way that doesn’t reinforce social or cultural bias. The latter can appear in different stages of the data cycle – the data used as an input for analysis or ML development can have in-built biases, or the bias might be introduced by humans when analyzing information or developing algorithms.
One of the most common biases is sampling or representation bias, which appears with a wrong definition and sampling of a population or group represented in a specific dataset. This sort of bias is well-known for any student doing quantitative or qualitative analysis. Although bad sampling can spoil entire research, it is easy to detect if one is looking carefully at the data they have.
However, things get more complicated in the case of big data, which is mostly noisy and comes in huge volumes. Finding and clearing implicit prejudices and small discrepancies from large datasets is challenging. An infamous example of sampling bias happened a few years ago when researchers found that a prominent had difficulties recognizing female and dark-skinned faces since the ML algorithm has been predominantly trained on white male pictures.
Bias can also occur during the data analysis or ML model development stage, for example, when labeling data or supervising algorithmic outputs. Even if the dataset is well sampled and cleaned, a human supervisor might still bring in her own stereotypes and reject the outputs of a correct algorithm. Unfortunately, such bias is often unconscious, based on deeply ingrained social and cultural schemas, beliefs, and habits, and thus difficult to eliminate.
Is developing bias-free AI even possible?
Back in 2016, the White House issued a , emphasizing data fairness and stating that discrimination may be an unintentional outcome of how big data technologies are structured and used. Seven years later, unfortunately, the cases of biased decisions made by data-driven technologies are more prominent than ever. It is possible to blame rapid AI developments and a lack of regulation; a less comfortable approach would be asking whether AI and the data it is built upon can truly be totally objective and fair.
An ML model is just an algorithm, an equation. It takes the given data and gives output through a human-made system. As such, AI is a predictive system that reveals common statistical patterns in the training data. If gender (or other feature) is disproportionately represented in the data fed to the algorithm, the latter can infer that some patterns are correlated with gender and use it indirectly to make a final decision. This way, human stereotypes become ingrained in software through hidden correlations.
Statistical and computational methods for recognizing and eliminating bias in various stages of ML and AI development do exist. However, they increase computational expenses and complicate the engineering process – a price not every company will be willing to pay, especially considering the pressure coming from the competitive race. Furthermore, these methods can be successfully applied in some instances but not others.
Consider the popular belief that one can build better predictive algorithms with more data, and hence, if objectivity is lacking, big data will solve this issue. Unfortunately, there is almost always proportionately less data available about minorities. The same can be said about social, economic or environmental phenomena that are new and lack historical data records. One way out of this trap is generating synthetic data, but basing automated technologies that affect people’s lives on artificially created information is risky.
As long as data is produced, analyzed, and the algorithms are built by humans, the bias will persist, at least to some extent. According to scientists, having bias and prejudice is a natural way our brains function, and thus, they are impossible to eliminate. Nevertheless, we can still make our data-driven technologies more fair and inclusive.
Rethinking the hard truths
The first step toward this goal is to rethink the basic assumptions we have about data. For example, that more data brings more objectivity and accuracy. Theoretically, this should be true. However, large amounts of data might also increase the space for bias as it is more complicated to notice and clear implicit prejudice from a large dataset.
Furthermore, data about minorities or marginalized groups is often difficult to access or simply doesn’t exist in the same quantities as data about the dominant groups. This might have happened with previously mentioned Amazon's face recognition tool – the sources the data was fetched from probably had thousands or millions of pictures, but they were not equally distributed race- or gender-wise.
It is worth noting that, in some cases, minorities may be overrepresented. For example, there have been a lot of discussions around ML algorithms used for predictive policing that, apparently, had been based on historical data that reflected biased policing in the neighborhoods of black and Latino people. Such biases in data are not easy to eliminate or even notice, firstly because we tend to trust data-driven decisions and historical data records as objective.
In the case of AI, this assumption is particularly misleading. ML algorithms – a basis of almost any AI system – are designed to locate statistical patterns in large datasets, not to reveal some sort of pure, objective truth. Let's take ChatGPT as an example – it is based on language modeling and probability calculations to determine the next word. Mathematics makes ChatGPT sound very compelling, even when it is wrong.
Last, it is worth remembering the old rule that correlation doesn't always imply causation. The fact there are significantly fewer women in top IT positions doesn't imply that women are less capable of doing this but simply reflects dominant historical patterns in the labor market that were highly influenced by biased education and lack of flexibility in the workplace. This is self-explanatory to many humans, but the bias still persists in the big-data-fed ML models, as an example of Amazon’s has shown.
Big data offers tremendous opportunities for better decision-making. Today, with the help of ML and AI, we can scan massive amounts of information to find patterns and anomalies that were previously unthinkable and do it unbelievably fast. On the other hand, big-data-powered ML models are also used to make automated decisions that can significantly affect people’s lives, from credit scores to employment opportunities and even criminal justice.
As organizations delve deeper into the landscape of big data, it is crucial for them to understand the possibility of negative consequences if such data is used carelessly. Industries should utilize big data for the best results but, at the same time, take responsibility for the fair use of it.
Asking what the goal of collecting specific data is, how we are going to manage it, who this data represents, and who (or what) this data is going to affect is crucial before getting down to any serious data collection and analysis endeavor. In the context of fairness, an additional question of whether this data fully covers the problem area and its context would be beneficial.
The UK government already has a public description of a role that might look quite mythical for some – . However, as I tried to argue in this article, clearing data from implicit biases and prejudices isn’t an easy task, and it might well happen that in the future, we will meet data ethicists in the corridors of most data-driven businesses and public organizations.