"AI is like electricity. Just as electricity transformed every major industry a century ago, AI is now poised to do the same." - Andrew Ng, AI Entrepreneur and Computer Scientist
Public data is fuel
Large Language Models (LLMs) are advancing rapidly due to continuous improvements in machine learning and extensive access to vast amounts of internet data. AI firms argue that information freely available online should be usable for training models. However, copyright law dictates that using this material, even for training, should be subject to a license fee and explicit consent from the original content producer.
Core conflict
This fundamental disagreement has spurred a fierce debate between powerful AI hyperscalers and content producers, which include news agencies, book publishers, and entertainment companies. In this light, the Department for Promotion of Industry and Internal Trade, DPIIT, has released a working paper. This proposal is seen as a welcome step towards finding a solution that supports content providers without putting India's domestic AI ecosystem at a disadvantage.
India’s new approach
The DPIIT paper suggests establishing a mandatory framework for data scraping. This system would allow AI developers to scrape publicly available information while ensuring that a non-profit copyright society collects payments. These payments would be accrued from AI developers based on the revenue they earn, derived from the benefit of training their models on Indian content producers' data.
Mandated scraping is problematic
The reasoning for the proposed system is viewed by some as partially flawed, resting on the belief that data processing is an inherent right for AI models and by overriding existing copyright law. This approach is detrimental to small publishers, who put significant effort into their work but would receive minimal royalty, while large media houses potentially profit substantially from the new structure.
Remuneration?
The lack of a uniform system for remuneration is urgent to prevent continuous litigation between publishers and AI firms. The government needs to seize the momentum provided by the white paper and the accompanying dissent by the tech industry to ensure a collaborative and flawless framework is enacted. A poorly designed system could seriously disrupt the media and internet landscape.
Summary
India's DPIIT has proposed a mandatory data scraping framework to ensure AI companies pay royalties to content creators for using their data. While a welcome step, critics argue the system is flawed by overriding existing copyright law and unfairly disadvantaging small publishers, highlighting the urgent need for a fair, collaborative remuneration model.
Food for thought
If the development of large language models relies entirely on the existing corpus of human-generated work, what exactly constitutes innovation in AI and where should the line be drawn between learning and theft?
AI concept to learn: Copyright Debate in AI Training
The
copyright debate in AI training centers on whether using copyrighted
text, images, audio, and video to train large models constitutes fair
use or unauthorized reproduction. Creators argue that AI companies
benefit from their work without permission or compensation, while
developers claim that training uses data statistically, not as stored
copies. Courts worldwide are examining issues like transformative use,
market harm, and transparency obligations. Some propose licensing,
compensation pools, or opt-out mechanisms. As generative AI expands,
balancing innovation with creators’ rights has become one of the most
important legal and ethical challenges in the AI ecosystem.
[The Billion Hopes Research Team shares the latest AI updates for learning and awareness. Various sources are used. All copyrights acknowledged. This is not a professional, financial, personal or medical advice. Please consult domain experts before making decisions. Feedback welcome!]

COMMENTS