We Don't Need Large Datasets

Ford’s internal combustion engine car beat Edison’s EV to the market and as a result we are on our current fossil fuel dependent path. What if things were different. Few Shot Learning is an alternative to data guzzling artificial intelligence models that allows us to not be dependent on large datasets.

This article was first published in The Mint. You can read the original at this link.


A hundred and thirty eight years ago, almost to the day today, Thomas Alva Edison switched on his coal fired power plant at his Pearl Street Station in lower Manhattan and provided commercial electricity, for the first time in history, to 59 homes within a square mile distance of his power plant. Many see this event as the birth of the modern era, as it marks the origin of our dependence on electricity to the extent that it is integral to almost everything we take for granted today.

I have written about Edison previously and have always found it hard to disguise my admiration for the sheer breadth of his inventive genius:

It is this perspicacity to see the connections that exist between different inventions and the ability to make great leaps across disciplines that made him the greatest inventor of the modern era.

But the kind of power generation that Edison pioneered on that day in September, 1882 put us on trajectory that has had unfortunate outcomes. He kicked, into over-drive, our reliance on fossil fuels for energy, allowing it to permeate into all aspects of our lives - from the electricity we need to power our homes, offices and factories to the petroleum we need to run our cars, ships and planes. As a result he forced us down a path of high-energy consumption that has resulted in the rapid depletion of naturally occurring carbon-based fuel sources and inflicted near irreversible damage on our planet.

Edison’s choice of coal as the fuel source for his power plant should not be taken to be indicative of his support for fossil fuels as a source of energy. To the contrary, he was aware, almost prophetically, of the harmful consequences of continuing down this path. “Someday some fellow will invent” he said, “a way of concentrating and storing sunshine to use instead of this old, absurd Prometheus scheme of fire.”

He believed in the need to harness solar, wind and tidal energy at a time well before the science to do so was feasible.

“Sunshine is a form of energy, and the winds and the tides are manifestations of energy. Do we use them? Oh no; we burn up wood and coal, as renters burn up the front fence for fuel. We live like squatters, not as if we owned the property. There must surely come a time when heat and power will be stored in unlimited quantities in every community, all gathered by natural forces. Electricity ought to be as cheap as oxygen, for it cannot be destroyed.”

Most importantly, he believed that we should make automobiles that run on electricity - not petrol - and to that end had built a vehicle powered by alkaline batteries of his own invention. But despite the fact that the battery technology that he had developed went on to power electric trucks, railroad signals and even a submarine, it took so long for him to perfect its design of the electric car that it went on sale a full year after his good friend Henry Ford introduced the world to his low-priced, high-mileage Model T car. As a result, it is the internal combustion engine powers the world today and, instead of using renewable energy for our needs, we are stuck with our current gas-guzzling lifestyle.

I cannot help think how different things might have been had Edison beaten Ford to the market.

We find ourselves at a similar crossroads with artificial intelligence today. The dominant techniques that we currently use - that have delivered advances as miraculous as facial recognition, voice recognition and natural language processing - are voracious in their consumption of data. They depend on the availability of massive training datasets comprised of millions of individual elements of structured data without which it is impossible to train the complex models that we use to identify patterns not immediately evident to human senses.

As a result, leadership in artificial intelligence, is widely associated with the ability to access large volumes of structured information of the kind that is presently under the control of only the largest tech companies in the US and China. This has resulted in an arms race for the control of data with countries around the world exerting their authority over the data of their citizens regardless of where or under whose control it might be. Europe leveraged its data protection regulation to make it next to impossible for data to be transferred out of its shores while India demanded that this data be localised and established a committee to look into how non-personal data might be harnessed to better allow its value to accrue to the nation.

But the very fact that these dominant machine learning models are incapable of accuracy unless they have sampled large volumes of data, is a sign of their extreme inefficiency. Even a child can identify objects that it has seen just once before. We do not need to have trawled through vast libraries of bird pictures to know one when it flies by. What’s more, because our minds are capable of synthesising and learning new object classes based on existing information about different, previously learned ones, we can identify an object to be a bird even if we have never seen that particular species before.

What we need today are machine learning techniques that can achieve a higher level of sample efficiency and transferability than is possible using traditional deep learning techniques. If we can do that we will be able to shake ourselves free of our dependence on the data-guzzling models of artificial intelligence that we are currently wedded to.

Few shot learning is an exciting new technique that allows us to train our models on sparse volumes of data. At present it is largely being applied in areas of image classification, retrieval, and segmentation but is likely to have broader application in areas of natural language processing, drug discovery and other areas, where structured data is hard to find. While still in its infancy, there are signs that this is the direction in which artificial intelligence is headed.

If this is, in fact, the case, there is an argument to be made for us to completely re-think our current approach to data. If achieving a leadership position in artificial intelligence no longer means amassing vast stores of structured datasets, would we not be better-off focussing our legislative energies towards encouraging the use of these new and more efficient methods of computer learning?

Edison knew that electric transportation was the way to go. He even gave us the prototype of an electric vehicle well before internal combustion engines were even a thing. Had we chosen electric, we could have avoided the gas guzzling century that followed.

Today we have to make a similar choice - between the ravenous, data-guzzling models that are behind the artificial intelligence models we know and the few shot learning models that are just about coming into their own. I hope, at least this time, we make the right choice.