
India's many languages pose a challenge to the development of its large language model
NEW DELHI: India is building its own large language model it hopes one day may rival OpenAI's chatbot ChatGPT, but the country's countless languages and dialects have made training it a challenge.
India has 22 officially recognised languages and more than 10,000 local languages.
Some languages like Marathi share common roots with others such as Hindi and Gujarati, while others spoken in South India - such as Kannada, Telugu, Tamil and Malayalam - are completely different.
A large language model has to process these multiple languages seamlessly, and coding an AI model capable of understanding most of them, if not all, remains complicated.
TRAINING AI ON LOCAL LANGUAGES
One challenge faced by BharatGen, a consortium funded by India's government, in training their large language model is a lack of online content in Indian languages.
The consortium said that while roughly half of all the data available on the internet is in English, Indian languages make up barely 1 per cent.
Literary works in many Indian languages have never been digitised, while a raft of cultural and traditional information has been verbally passed down for generations without being stored online.
On a more positive note, experts said that the diversity of languages and data collected from local sources could help create AI models with fewer biases.
Ganesh Ramakrishnan, a professor at the Indian Institute of Technology Bombay, told CNA his work involved reaching out to magazines, data sources, foundations and non-governmental organisations who have been gathering data in their local languages.
'(We have been) making it possible to digitise and digitalise and reflect that in the foundational model … so this is a big opportunity,' said Ramakrishnan, who is part of the BharatGen consortium.
EXISTING CHATBOTS ARE INADEQUATE
Some small business owners, who have tried using AI as part of their operations, said they have faced language challenges when using existing chatbots.
Ghooran Yadav, a food cart owner in New Delhi, said that he used ChatGPT to enquire about the recipe of the food he sells, but received an underwhelming response.
The app understood his question in the local dialect of Bhojpuri but replied in Hindi.
Ghooran said foreign chatbots are not as accurate and that he prefers a locally-made app.
'If it's made in India, it's more likely to give me correct information. Nothing could be better than that,' he added.
EASE OF USE
BharatGen is also aiming to utilise generative AI to solve everyday problems and eventually help deliver services such as providing information about welfare programmes to the people.
An app called Krishi Saathi ('With Farmers' in Hindi), which is powered by BharatGen's Hindi language model, is helping to answer farmers' questions about crop health and pest management.
The app can translate text to local languages. It also allows those who are unable to read or write to communicate by speaking via the app.
'Making sure that the most remotely inaccessible regions also benefit from AI - that is part of the vision here,' said Ramakrishnan.
The AI model can copy a speaker's voice and tone, communicating with the user like an actual person once it has been trained to do so.
BharatGen, one of five major language-based AI projects currently supported by Indian Prime Minister Narendra Modi's government, has already rolled out 19 language models since its inception last year.
Experts said platforms like BharatGen need to invest billions of dollars on graphics processing units and data centres to achieve made-in-India generative AI at scale.
The hefty price tag would be a small price to pay to transform India from a major tech service provider to a major tech disruptor, in what could soon be a trillion-dollar market.
'India is all about scale and complexity,' said Shekar Sivasubramanian, head of the LEHS-AI unit at non-profit AI institute Wadhwani AI.
'If it is solved in India, and if it works in India, chances are, it will work in the world. That's the opportunity.'
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Business Times
7 hours ago
- Business Times
Tech giants' indirect emissions rose 150% in 3 years as AI expands, UN agency says
[GENEVA] Indirect carbon emissions from the operations of four of the leading AI-focused tech companies rose 150 per cent on average from 2020 to 2023, due to the demands of power-hungry data centres, a United Nations report said last week. The use of artificial intelligence (AI) by Amazon, Microsoft, Alphabet and Meta drove up their global indirect emissions because of the vast amounts of energy required to power data centres, the report by the International Telecommunication Union (ITU), the UN agency for digital technologies, said. Indirect emissions include those generated by purchased electricity, steam, heating and cooling consumed by a company. Amazon's operational carbon emissions grew the most at 182 per cent in 2023 compared with three years before, followed by Microsoft at 155 per cent, Meta at 145 per cent and Alphabet at 138 per cent, according to the report. The ITU tracked the greenhouse gas emissions of 200 leading digital companies between 2020 and 2023. Meta, which owns Facebook and WhatsApp, pointed Reuters to its sustainability report that said it is working to reduce emissions, energy and water used to power its data centres. A NEWSLETTER FOR YOU Friday, 12.30 pm ESG Insights An exclusive weekly report on the latest environmental, social and governance issues. Sign Up Sign Up Amazon said it is committed to powering its operations more sustainably by investing in new carbon-free energy projects, including nuclear and renewable energy. Microsoft highlighted its sustainability report, which says it doubled its rate of power savings last year and is transitioning towards chip-level liquid cooling designs, instead of traditional cooling systems, to reduce energy consumption at its data centres. The other companies did not respond immediately to requests for comment. As investment in AI increases, carbon emissions from the top-emitting AI systems are predicted to reach up to 102.6 million tons of carbon dioxide equivalent per year, the report stated. The data centres that are needed for AI development could also put pressure on existing energy infrastructure. 'The rapid growth of AI is driving a sharp rise in global electricity demand, with electricity use by data centres increasing four times faster than the overall rise in electricity consumption,' the report found. It also highlighted that although a growing number of digital companies had set emissions targets, those ambitions had not yet fully translated into actual reductions of emissions. REUTERS


CNA
14 hours ago
- CNA
Meta in talks over Scale AI investment that could exceed $10 billion, Bloomberg reports
Meta Platforms is in talks to make an investment that could exceed $10 billion in artificial intelligence startup Scale AI, Bloomberg News reported on Sunday. The terms of the deal were not yet finalized and could still change, the report said, citing people familiar with the matter. Scale AI declined to comment and Meta did not immediately respond to Reuters request for comment outside regular business hours. Founded in 2016, Scale AI is a data labeling startup backed by tech giants Nvidia, Amazon and Meta. Last valued at nearly $14 billion, Scale AI also provides a platform for researchers to exchange AI-related information, with contributors in more than 9,000 cities and towns.


CNA
15 hours ago
- CNA
Meta in talks for Scale AI investment that could top $10 billion, Bloomberg News reports
Meta Platforms is in talks to make an investment that could exceed $10 billion in artificial intelligence startup Scale AI, Bloomberg News reported on Sunday.