Balancing Big Data and privacy

August 01, 2018

The Justice Srikrishna Committee’s data protection framework aims to balance individual privacy with the growth of the digital economy, distinct from models in the US, EU, and China. But the committee missed opportunities to encourage de-identified data use and set impractical standards for anonymization. Concerns arise from the draft law’s definition of harm, potentially hindering AI and machine learning applications in social contexts by categorizing service denial based on evaluative decisions as harmful, which could restrict beneficial financial and social inclusion technologies.

This article was first published in The Mint. You can read the original at this link.

One of the most exciting promises that the Justice Srikrishna Committee held out was that the data protection framework it suggested would protect individual privacy while ensuring that the digital economy flourished. It claimed that in doing so it would chart a path distinct from the US, the European Union and China, one that was finely tuned to the new digital economy. If it was going to deliver on this, its biggest challenge was going to be designing its privacy framework to address both the promises and challenges of Artificial Intelligence and Big Data.

As I read through the report, I was glad to note that the committee had devoted considerable space to the subject. While discussing the principles of collection and purpose limitation, the committee observed that the purposes for which Big Data applications use data only become evident at a later point and that it is, therefore, impossible to stipulate a purpose in advance. As a result, the committee had noted that “limiting collection is antithetical to large-scale processing; equally, meaningful purpose specification is impossible with the purposes themselves constantly evolving". This is the most succinct analysis of the privacy issue central to the regulation of Big Data technologies that I have read. It gave me hope that the report would articulate a solution that achieved this fine balance.

However, other than vaguely suggesting that personal data should be processed in a manner that does not result in a decision being taken about an individual and, where it does result in such a decision, that explicit consent should first be obtained, the report does not provide any new or innovative solution to the concerns that it so eloquently articulated. The accompanying draft Personal Data Protection Bill, 2018 retains the principles of collection and purpose limitation, departing not a whit from the formulation commonly found in most data protection legislations. Despite recognizing the many benefits of big data and the need to encourage its growth, the committee had offered no useful suggestions as to what should be done.

I had hoped that the committee would encourage the use of de-identified data sets by suggesting that companies that design their systems to de-identify data would be exempted from some of the provisions of the law. This would have encouraged organizations to incorporate privacy into the design of their systems from the ground up. At the same time, it would have generated valuable data sets that could be of use in Big Data applications. Instead, the committee seems to have gotten itself so mired in concerns around the possibility of reidentification that it has only exempted the applicability of the law to data that has been irreversibly de-identified.

I am sceptical as to whether there can ever be such a thing as completely irreversible anonymization. Experience has shown that machine-learning algorithms are able to derive personal insights from even the most thoroughly anonymized data sets. Instead of prescribing an impossible standard, the committee would have done well to place the onus of ensuring anonymity on the entity responsible for maintaining these anonymized data sets—only allowing them exemptions from their privacy obligations if they could demonstrate that their use of these data sets does not compromise the identity of any individual. Should technology evolve to the point where it is capable of re-identifying individuals in their databases, it will be their responsibility to upgrade their solutions to ensure that anonymity is maintained despite these new advances. As an added advantage, if the individuals in these anonymous data sets want, they can consent to being re-identified to partake of the benefits that being part of that data set offers them.

But these are all examples of what the committee could have done. What is a genuine concern are the things the draft actually contains that could retard development. Primary among these is the definition of harm and, in particular, one of its sub-categories—the “denial or withdrawal of a service, benefit or good resulting from an evaluative decision about the data principal". If harm is defined in this manner, it could well have a deleterious impact on everyone using machine learning and Artificial Intelligence (AI) in the social context.

One of the primary uses of machine learning is to discover, using AI techniques, new and valuable insights that remain hidden from us when we use ordinary human intelligence. Using these techniques, flow-based lending platforms have been able to bring thousands of people into the banking system, offering them loans and other financial products that they were otherwise ineligible to avail of. Do these processes take evaluative decisions that might deny someone a service while offering it to someone else? Of course. But so long as the result is not unfair, no harm is done. On the contrary, huge swathes of people who were hitherto unable to access the financial markets now have a chance.

We have to ensure that the algorithms do not discriminate unfairly against anyone. But, to declare, as the draft law seems to have done, that every denial of service based on an evaluative decision is harmful is tantamount to throwing the baby out with the bathwater.