Big Bad Data

January 11, 2017

Statistician Fredrick Hoffman, known for identifying health risks like asbestos and tobacco, is also remembered for his flawed 1896 study claiming African Americans were inherently sicker than whites. This study, influenced by prejudice, had lasting negative impacts. As we increasingly rely on data and predictive algorithms, it’s crucial to avoid such biases and ensure fair, accurate interpretations to prevent perpetuating discrimination and injustice.

This article was first published in The Mint. You can read the original at this link.

Fredrick Hoffman was a statistician employed by the Prudential Life Insurance company tasked with uncovering risk patterns in medical data. So good was he at his job that even though he had no medical training, he managed to uncover the harmful side-effects of asbestos, identify silicosis as a real disease that was causing fatalities among American workers and establish the causal relationship between smoking tobacco and lung cancer. By the time he died in 1946, he had 28 books and nearly 1,200 published articles to his credit.

Despite his prolific output towards the later part of his life, he is best remembered for his first book—The Race Traits and Tendencies of the American Negro. Originally published in 1896, this is, arguably, the social science study that most profoundly impacted turn-of-the-century American society.

Hoffman conducted a detailed analysis of disease rates among freed slaves and concluded black people, as a race, were sicker and more disease-prone than whites and therefore were on a downward spiral to extinction. Time has shown that this analysis was flawed but given the conviction with which he presented the data, it was, at the time, all that was needed to render the entire African-American community effectively uninsurable. As a result, sick African Americans were unable to afford healthcare and got sicker—cruelly converting his flawed report into a self-fulfilling prophecy.

We now know that Hoffman’s mistake was in confusing causation with correlation. Blinded by prejudice, he never stopped to think that it was poverty and injustice, rather than race, that was to blame. But his faulty conclusion had deep consequences, reinforcing a wrong that has inflicted lasting damage on an entire community that is felt to this day.

The problem with data is that it can be presented in myriad ways, and while individual elements of data are immutable, in aggregate, a database can be arranged to mean different things in different contexts. The Hoffman report is a telling example of the harm that can be wrought by drawing false conclusions from data sets. As we increase our reliance on data for decision-making, using big data and machine learning to help us determine the appropriate level of premiums we should pay on our insurance or our eligibility for a job, we would do well to ensure that in our eagerness to become more scientific in our decision-making, we don’t end up seeing only those patterns that we want to see.

The Crime and Criminal Network Tracking System is the Indian government’s attempt at implementing a form of Predictive Policing— using big data to identify potential criminals before they commit their crimes. The government plans to connect the 14,000 police stations across the country in order to facilitate rapid investigation and detection of crime. In doing so it will, for the first time, correlate existing databases of criminal and history sheeters with geographical information, first information reports and allied data in order to be able to better anticipate criminal activity.

As we start down this path of data-assisted crime prevention, we must be mindful of the historical biases inherent in our criminal databases and ensure that when they are trawled by machines, we don’t allow machines to institutionalize our prejudice.

Take for instance, the Pardhis, a denotified “criminal" tribe whose members are routinely rounded up by the police whenever there is a crime in the area. Members of the tribe populate criminal databases around the country—sometimes just by virtue of belonging to that tribe. If our computers were to blindly rely on these historical databases they could, much like Hoffman, reinforce historical bias and force the community into persistent machine-determined discrimination.

One of the often-touted ancillary benefits of digital payments is the extent to which this technology can revolutionise micro-lending. People who have, till now, been unable to provide evidence of credit-worthiness will be able to present a trail of their digital transactions, providing evidence of their ability to repay. But as the machines begin to amass greater volumes of transactional data, they will be able to build more specific personal profiles of our behaviour, allowing them to take nuanced decisions and make subtle discriminations between otherwise similarly situated individuals.

As much as I appreciate the many benefits that data offers us, I remain acutely conscious of the harm that it can cause. The current thinking is that because big data and machine intelligence can solve so many of our most pressing problems, we should focus on the good and not worry about the potential harms. I worry that in rushing blindly after short-term gains we will end up institutionalizing a new data-based caste system that will be that much harder to unravel.