“Languages are the wealth of our culture, and AI must learn to speak them all.” – Sundar Pichai, CEO, Google
The great Indic data hunt
India’s AI race has entered a crucial phase as the government-backed IndiaAI Mission pushes for indigenous language models. With an allocation of ₹10,000 crore, the mission aims to make AI accessible across India’s linguistic diversity. But the biggest hurdle remains there is not enough high-quality Indic language data to train these models efficiently.
Building AI for Bharat
Startups like Sarvam AI and Soket Labs are leading the charge, building foundational models similar to OpenAI’s GPT but tailored for Indian languages. Sarvam’s 120-billion parameter model and Soket’s open-source 120-billion model aim to power government, education, and healthcare applications. Gan.ai and Gandi.ai, meanwhile, focus on speech and multimodal tools to bring AI closer to everyday users.
The data scarcity challenge
While OpenAI and Anthropic rely on massive global datasets, Indian firms must find creative solutions. Initiatives like AI4Bharat and Bhāṣini have begun compiling text and speech data across 22 Indian languages, but the scale is still limited. These datasets are essential to bridge India’s language gap in the global AI race.
Innovation at a fraction of global cost
Training a foundational model in the West costs billions, but Indian startups are innovating on lean budgets. By combining crowdsourced data, multilingual corpora, and targeted datasets from agriculture, education, and healthcare, India’s AI ecosystem is crafting frugal yet powerful models.
The road ahead
India’s AI journey depends on solving the Indic data puzzle. As startups, researchers, and policymakers collaborate, the vision is clear to make AI not just intelligent, but inclusive and multilingual.

COMMENTS