DeepSeek’s rise spurs debate on data ethics
The emergence of ChatGPT-like Chinese AI model DeepSeek wiped out about $1 trillion in market cap within a few hours last week, shaking up the powerful US tech giants. Reportedly, DeepSeek was built at a low cost, utilising a small portion of the processing power employed by OpenAI. Analysts have referred to DeepSeek's launch as AI's Sputnik moment, referring to the space race between the US and the Soviet Union that was induced by the 1957 launch of the Soviet spaceship, Sputnik.
Clearly, AI's shroud of mystery has been torn apart by DeepSeek. There have been signs of American resistance, too. For example, due to "potential security and ethical concerns", the US Navy apparently forbid its members from using DeepSeek.
That's not all. Some US tech leaders have attempted to change the narrative by portraying DeepSeek as the villain. OpenAI and Microsoft, its largest investor, are investigating whether DeepSeek violated OpenAI's terms of service by using its intellectual property to create its competitor. "We take aggressive, proactive countermeasures to protect our technology," a representative for OpenAI has stated.
The US administration, too, has reacted. David Sacks, President Donald Trump's AI and crypto czar, has said that there's substantial evidence that DeepSeek "distilled" knowledge out of OpenAI's models. Leading US AI companies are likely to take "steps to try and prevent distillation" over the coming months in an attempt to impede potential alleged copycat models.
So, what is "distillation"? It's a copycat technique used by developers to train smaller AI models on the performance of larger, more complex ones, and it's not precisely stealing. Sacks explained distillation to Fox News thus: when one model learns from another model, effectively "the student model asks the parent model a lot of questions, just like a human would learn, but AIs can do this asking millions of questions, and they can essentially mimic the reasoning process they learn from the parent model and they can kind of suck the knowledge of the parent model."
It's a well-established principle in AI research and is frequently used to increase the precision of smaller large language models (LLMs). This procedure has become so common in deep learning that Nobel laureate Geoffrey Hinton has co-authored a paper about it, which is often cited. In particular, Hinton and two other Google scientists' 2015 joint paper titled "Distilling the Knowledge in a Neural Network" makes the case that distillation can increase the efficiency of LLMs and that it "works very well for transferring knowledge from an ensemble or from a large, highly regularised model into a smaller, distilled model."
Some tech leaders have pointed out that distillation is a standard procedure in AI and most of the innovations and concepts in LLM are "borrowed." Also, distillation "has become an important tool in the democratisation of generative AI," according to an IBM article. In fact, "the main beef Silicon Valley has is that China's chatbot is democratising the technology," Kenan Malik has written in a recent piece in The Guardian.
A company, however, may face legal issues if it uses data from proprietary technologies even as open-source technology frequently permits it, though the legal premise may not be clear everywhere. Zack Kass, AI consultant and former OpenAI go-to-market head, clarifies that there's a fine line between "distillation" and "extraction".
Also, it seems that many were surprised by OpenAI's claim that DeepSeek practically stole its data. It rekindled the discussion about what, in the context of AI, qualifies as "stealing". CNN's Allison Morrow, for example, wrote: "OpenAI, a startup that's built on a foundation of data it scraped from the internet without permission, is pointing the finger at another startup allegedly doing... more or less the same thing."
The article by Jason Koebler titled "OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole from Us" in news website 404 Media portrays the scenario clearly.
In fact, The New York Times and many other content creators have sued OpenAI, claiming that the company trained its LLMs on copyrighted material. However, OpenAI contends that its actions should be permitted under "fair use" because training GenAI models with copyrighted content "serves a new 'transformative' purpose."
Additionally, as per a Times report, OpenAI has transcribed YouTube videos using speech recognition tech, producing fresh conversational material that would improve the intelligence of an AI system. Over the past few years, different GenAI companies have been sued by a number of authors, including John Grisham, Jonathan Franzen and George Martin, claiming a "systematic theft on a mass scale." Margaret Atwood and Philip Pullman have signed an open letter, demanding payment for the use of their works by AI companies. Visual artists have sued Stability AI, Midjourney and DeviantArt for copyright infringement. Stability AI was also sued by Getty Images. Spotify and Apple Music were urged by the Universal Music Group to refrain from using its content to teach AI algorithms to create new songs. And, so on.
Recently, New Delhi-based Federation of Indian Publishers filed a case in the Delhi High Court, seeking to "stop (OpenAI) from accessing our copyright content." Publishers like Bloomsbury, Penguin Random House, Cambridge University Press, Pan Macmillan, Rupa Publications and S Chand and Co were among the federation’s members that brought the case.
However, as India aspires to have its own DeepSeek moment soon, how would it handle the distillation issue? Union IT Minister Ashwini Vaishnaw has said India would host DeepSeek on its servers to resolve privacy concerns. To keep India's AI development open and application-driven, he stressed the significance of distillation in AI models. Accessible computing capacity is crucial for startups, academics and universities, he said. He referred to it as "the most important part of the mission."
In fact, apart from the US tech giants trying to reclaim their lost market share, AI's Sputnik moment has spurred a wider discussion about data ownership and may lead towards a reasonable formulation of a global outlook — and regulations — towards AI's uses of data — and "distillation", of course. The DeepSeek moment could have a favourable outcome if that occurs eventually.