3 days ago
Key step in democratising AI: IIT-B releases 16 datasets on AIKOSH
In an important milestone for India's Artificial Intelligence (AI) ecosystem, the Indian Institute of Technology (IIT) Bombay has released 16 diverse and culturally significant datasets on AIKOSH, India's official AI repository, making it among the biggest contributors to AIKOSH. This marks a crucial step in democratising AI by making high-quality, India-centric data openly accessible to researchers, startups and developers across the country.
IIT Bombay made the announcement on X, saying that these datasets are designed to support innovation and research in AI and Machine Learning (ML), particularly in the Indian context.
*IIT Bombay Releases 16 AI Datasets on AIKOSH: Enabling the Future of Responsible AI in India 🇮🇳*
IIT Bombay is thrilled to announce the release of 16 diverse and culturally significant datasets on AIKOSH, the Government of India's official AI repository. These datasets are…
— IIT Bombay (@iitbombay) May 30, 2025
AIKOSH, which was launched in March by the Ministry of Electronics and Information Technology, is a national platform aimed at providing support for inclusive AI development across the country.
The 16 datasets by IIT Bombay are part of a larger pool of 21 AI models now available on AIKOSH, which were created by BharatGen, a Section 8 company funded by the Department of Science and Technology for indigenous AI development in India. The company is a consortium of seven partners. Led by IIT Bombay, the consortium includes IIT Kanpur, IIT Mandi, IIT Hyderabad, IIT Madras, IIM Indore and IIIT Hyderabad.
Prof Ganesh Ramakrishnan, Department of Computer Science Engineering, IIT Bombay, said, 'Our goal is not just to build AI models but to provide resources that startups and system integrators can leverage, creating a favourable and sovereign AI ecosystem for India.'
The datasets released on AIKOSH include handwritten and printed Indian scripts, multilingual audio data and resources designed to interpret visual and spoken inputs from Indian environments. Among the notable contributions are a large-scale Sanskrit Optical Character Recognition (OCR) dataset consisting of over 218,000 sentences from historical texts to support the digitisation of ancient Indian knowledge. There is also a speech recognition dataset with more than 78 hours of Sanskrit audio. Additional resources include capabilities for detecting tables across documents in 14 Indian languages and a comprehensive Wiki on Indian Knowledge Systems, among others.
Prof Ramakrishnan said, 'Equal emphasis on India data and its provenance allows these models to uniquely balance Indian data alongside English data, ensuring true relevance and understanding for our diverse nation, while also catering to its security. These models are built with Indian linguistic and cultural nuances at their core. By making these datasets available to all thorough AIKOSH, we are democratising AI in order to foster innovations across the country, eventually to build a self-reliant and inclusive AI ecosystem for India.'