Skinny Solstice

January 10, 2018

There is value in large datasets that reveal statistically relevant insights - such as trends in fast food visits or sleep patterns. This is relevant to India’s growing data economy given the potential to use aggregated anonymous data to make informed decisions. We need to implement privacy by design, and for that organizations need to de-identify personal information to prevent potential harm if the data is lost or stolen.

This article was first published in The Mint. You can read the original at this link.

Last Monday was Skinny Solstice. According to Foursquare, a search-and-discovery service mobile app, it is the day in the year that sees the lowest number of visits to fast food restaurants. It seemed this year it might also have been the day that everyone had decided to take up a gym membership because my gym was absolutely packed. It isn’t hard to figure out what was happening. The first Monday after New Year’s Day is the day most people get back from the holidays and is the first chance they get to put their New Year’s resolutions into action. After a guilt-ridden holiday season of feasting and relaxation, fitness is probably the first thing they think of implementing. In a month’s time, on the second Friday in February, it will be Fatty Solstice, the day which, according to Foursquare, has the highest number of fast food visits of any of the first three months of the year. That seems to be about as long as it takes us to realize that it’s too hard to stick to good intentions.

Although this information seems trivial, we might never have come to know of it had it not been for global location-based services like Foursquare that have, today, such a vast amount of check-in data across so many years that they are able to make statistically relevant statements like this. For years now, a number of similar services have been quietly collecting data from us to the point where these massive databases have finally reached the size and spanned the duration that makes their insights statistically relevant.

Fitbit recently released 6 billion hours worth of customer sleep data collected over the course of 2017 that revealed some remarkable facts about the ways in which we sleep. It tells us that in general women sleep 25 minutes more than men but are 40% more likely to suffer from insomnia; and that in general you used to get 30 minutes more sleep when you were 20 as compared to when you are 70. The statistic I found most interesting was the co-relation between the quality of sleep and sleep-time consistency—apparently, people who consistently sleep at the same time every day (no matter when that is) tend to sleep better than those who do not.

Now that India is rapidly embracing the data economy, information of this sort is starting to become available in India. Last week, I had the opportunity to look through some of the data trends in relation to the Aadhaar-enabled payments that are taking place throughout the country. I was surprised to find out that one of the largest corridors of money transfers was between Mumbai and Bihar. When I thought a bit about it I realized that this probably had to do with the large population of migrants from Bihar who live and work in Mumbai and regularly send money back home to their families—using the digital payment bridge as a substitute for the expensive hawala couriers they had to rely on previously. These insights help banking correspondents and financial services companies to position themselves in appropriate locations so as to better extend their coverage to the population that requires it.

This is the true value of data. It is in the large numbers—these big datasets of aggregated anonymous data that we will find truly useful patterns with which we can arrive at non-intuitive realizations that power our decisions. Most organizations only want to use data for this purpose. That said, all large datasets are made up of multiple individual units of personal data and if any of that data gets lost or stolen it could cause substantial harm to the individual to whom it relates. If all that organizations need is big data and if they don’t really care about individual information, then they must design their systems to de-identify the information as soon as it is collected.

This is one of the foundations of privacy by design, a principle that was first popularised by the Data Commissioner of Canada, but which the Justice Srikrishna Committee would do well to consider in its recommendations. Today, it is virtually impossible to prevent data being collected from us. What the law can do, instead, is stipulate that anyone who collects data or who controls it has the obligation to immediately render it harmless by separating it from the information that allows it to be identified to a specific person.

De-identified data retains all the properties that makes it useful from a big data standpoint but ensures that if it is lost or falls into the wrong hands, it is unlikely to cause harm.

These sorts of measures should allow us to get the most out of data while retaining appropriate safeguards.