logo
Cut through clutter: Tamil dataset to train AI models

Cut through clutter: Tamil dataset to train AI models

Time of India30-06-2025
Netizens engaging with AI models in Tamil or any regional language often come across incoherent translations, jumbled sentences, bizarre choices of words and poor grammar, but overlook them as the babels of a budding ecosystem.
But not Raju Kandaswamy, a senior IT professional, who believes such errors throw a spanner in the works of Tamil's linguistic integrity.
"Current training datasets are heavily distributed in English language and therefore do not accurately represent Tamil language or its cultural context. Users over a period of time absorb and internalise these biases leading to slow erosion of cultural values," he said. An increasing number of people, including senior citizens, rarely pick up books and consume Tamil content only through the internet or through speech.
In a world increasingly mediated by LLMs, involved in web searches, shopping and education, this creates a problem.
Kandaswamy is principal consultant at Thoughtworks in Coimbatore and part of AI Tamil Nadu, a non-profit community aiming to improve how AI models work in Tamil. The team is building a large-scale Tamil language dataset to train AI models, and is collaborating with authors and other organisations to curate large, high-quality datasets.
Their plan is using these data repositories to fine-tune open-source models such as Meta's LLama and make it available for anyone to build Tamil-specific models.
You Can Also Check:
Chennai AQI
|
Weather in Chennai
|
Bank Holidays in Chennai
|
Public Holidays in Chennai
He believes that these models can be used to deliver govt services, communicate welfare schemes, and enable vernacular education to the masses, especially the rural population. Abinaya Mahendiran, a natural language processing expert and member of AI Tamil Nadu, is leading the initiative named Vidhai.
She too thinks it is crucial for preserving Tamil culture.
"Access to high-quality datasets in less represented languages is limited, and Tamil is no exception. Machine-translated content is often inaccurate. So, we collect original Tamil texts such as books, essays and articles from various sources, clean them, and annotate them with the help of volunteers, students, linguists, retirees, and teachers. A trove of Tamil books and printed material is yet to be digitised," she said.
Today, many independent researchers and language enthusiasts are spending their own money to improve Tamil AI models. But as Abinaya notes, lack of computing resources and difficulty in mobilising volunteers are major barriers.
The Tamil Virtual Academy (TVA) has a digital library of more than 1 lakh books containing around 1.5 crore pages, spanning subjects from science to history. It is also developing tools like syntactic parsers, morphological analysers, and 'parts of speech' taggers, resources critical for NLP research.
Yet, fragmented efforts, siloed developments, and fuzzy copyright guidelines hinder collaboration. A senior official confirmed that TVA could collaborate with AI technologists, but ambiguity around copyright and fair use remains a bottleneck.
Navaneeth Malingan, founder of AI Tamil Nadu, is attempting to bridge the ecosystem, by bringing together various elements -- from scouting for students volunteers and linguists to getting access to computing resources through corporate sponsorship.
He says these kinds of models are crucial for delivery of govt services for locals, while commercial AI models will be useful for most business cases. "The govt can use it to fill forms through voice, give instructions to farmers and teach Tamil to the younger generation.
Various stakeholders including govt and companies should be brought together to build these models suitable for the use cases," he asserted.
The community is currently fine-tuning existing AI models to improve its performance in Tamil, but is ambitious about building one from scratch - albeit a small-domain focused model.
It will use a tokenisation method inspired by Nannul, the 13th century Tamil grammar treatise, to better reflect the language's morphological structure instead of the currently widely used Byte-Pair Encoding (BPE) method. Tokenisation refers to the process of breaking down text into smaller units called tokens.
From adopting the printing press in the 1500s (the first in India), to adopting Unicode for the internet, Tamil has consistently been an early adopter of new communication technologies. Now, it should be able to find its place in the AI age.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

CBSE approves open-book assessments for Class 9 from 2026-27
CBSE approves open-book assessments for Class 9 from 2026-27

India Today

timean hour ago

  • India Today

CBSE approves open-book assessments for Class 9 from 2026-27

The Central Board of Secondary Education (CBSE) has officially approved the introduction of Open-Book Assessments (OBAs) for Class 9 students starting from the 2026-27 academic session. The decision, ratified by the board's Governing Body on June 25, is in line with the National Curriculum Framework for School Education (NCFSE) 2023 and the National Education Policy (NEP) 2020, aiming to promote competency-based learning and reduce dependence on rote THE NEW SYSTEMUnder the new system, OBAs will be integrated into the three pen-paper assessments conducted each term, covering core subjects such as Language, Mathematics, Science, and Social Science. The initiative seeks to foster critical thinking, encourage real-world application of concepts, and alleviate exam-related stress amongst idea was first reviewed by the CBSE curriculum committee in November 2023 and received approval later that year. To assess its feasibility, the board conducted a pilot project in select schools. For Classes 9 and 10, the tests were trialled in English, Mathematics, and Science, while Classes 11 and 12 attempted them in English, Mathematics, and Biology. The pilot aimed to measure student performance, time management, and stakeholder feedback. Findings from the study revealed that students scored between 12% and 47%, with many facing challenges in effectively using reference materials and understanding interdisciplinary concepts. However, teachers remained optimistic about the system's potential to enhance higher-order thinking support smooth implementation, CBSE will provide a detailed framework, structured guidance, and standardised sample papers for open-book testing. Initially, the assessments are unlikely to be mandatory for all schools, allowing them flexibility in is not the first time CBSE has experimented with such an approach. Between 2014-15 and 2016-17, it introduced the Open Text-Based Assessment (OTBA) for Classes 9 and 11, which was later discontinued after mixed responses from students and traditional exams, open-book assessments allow students to consult textbooks and other approved resources during the test. The format is designed to evaluate conceptual understanding, analytical thinking, and the ability to apply knowledge in different contexts, marking a shift away from memorisation-focused learning.- EndsMust Watch

CBSE approves open-book exams for Class 9 from 2026-27: Report
CBSE approves open-book exams for Class 9 from 2026-27: Report

Hindustan Times

time2 hours ago

  • Hindustan Times

CBSE approves open-book exams for Class 9 from 2026-27: Report

The Central Board of Secondary Education (CBSE) has approved a proposal to introduce open-book assessments for Class 9 final exams from the 2026-27 academic session after a pilot study, Indian Express reported. The proposal for open-book exams is in line with the National Curriculum Framework for School Education (NCFSC) 2023 and the National Education Policy (NEP) 2020 Also read: Rahul Singh gets two-year extension as CBSE chairperson till November 2027 As per the report, the CBSE's governing body approved the proposal for open book exams after a meeting held in June. The report further stated that the proposal, in line with the National Curriculum Framework for School Education (NCFSC) 2023 and the National Education Policy (NEP) 2020, involves open-book assessments for Class 9 covering subjects like language, mathematics, science and social science. The proposal for open book assessment was discussed during the curriculum committee meeting of CBSE in November 2023 and was approved last year. As per the proposal, a pilot run of open-book tests was to be conducted at select schools for English, Mathematics, and Science for Classes 9 and 10 and English, Mathematics, and Biology for Classes 11 and 12. This was to gauge the stakeholders' response and the time taken by the students to complete such tests. 'The plan to carry out an OBE pilot in some schools was discussed as a part of the NCF recommendations. It could be used in internal exams to promote innovative thinking among students. However, it is only in an ideation phase, and there is no plan to adopt the OBE format in the CBSE class 10 and 12 board examination,' a senior CBSE official told Hindustan Times last year. The Indian Express report stated that there are challenges in conducting open-book examinations, such as low success rates revealed in the pilot study, but teachers expressed optimism about the system due to its potential for fostering critical thinking. CBSE plans to develop standardised sample papers to ensure the quality of questions. The board will create a framework for open-book internal exams for Class 9 and it is unlikely to be mandatory for schools, the Indian Express report said, citing an unnamed source. This is not the first time the central board has experimented with open-book examinations. The board had introduced Open Text-Based Assessment (OTBA) for the exams of Classes 9 and 11 from 2014-15 to 2016-17. Back then, responses from the student community and academicians were negative. What is an open-book exam? Unlike traditional exams, open-book tests involve students having access to textbooks and other relevant resources to answer questions. These exams usually comprise conceptual and analytical questions that require students to read, interpret, and apply the concept. Open-book exams are aimed at testing the higher-order thinking skills of students by stepping away from the regular mode of exams, allowing them to move away from rote learning to pass a test.

IIT Guwahati releases GATE 2026 syllabus, adds fresh sectional paper option
IIT Guwahati releases GATE 2026 syllabus, adds fresh sectional paper option

India Today

time2 hours ago

  • India Today

IIT Guwahati releases GATE 2026 syllabus, adds fresh sectional paper option

The Indian Institute of Technology (IIT) Guwahati has released the detailed syllabus for the Graduate Aptitude Test in Engineering (GATE) 2026, along with important updates to the exam format. The national-level entrance test is set to be conducted on February 7, 8, 14, and 15, 2026, with online registration beginning on August 25 through the official portal – as a gateway to MTech, MS, and PhD admissions in IITs, NITs, IIITs, and other premier institutes, GATE scores are also widely used by Public Sector Undertakings (PSUs) for CHANGES IN GATE 2026This year, the exam will include 30 test papers comprising both full and sectional papers. A notable addition is the 'Energy Science' section, introduced under the Engineering Sciences (XE) paper. Candidates may opt to appear for one or two papers, but the combination must align with the officially approved two-paper list, which will be announced on the GATE 2026 STRUCTURE Language: English onlyTotal Marks per Paper: 100General Aptitude (GA): 15 marksSubject-specific section: 85 marksQuestion Types: Multiple Choice Questions (MCQ), Multiple Select Questions (MSQ), and Numerical Answer Type (NAT)MARKING SCHEME1-mark questions: 1/3 mark deducted for each wrong answer2-mark questions: 2/3 mark deducted for each wrong answerSECTION-WISE DETAILSXE (Engineering Sciences) Paper:Compulsory: Engineering Mathematics (15 marks)Optional: Any two sections, 35 marks eachB: Fluid MechanicsC: Materials ScienceD: Solid MechanicsE: ThermodynamicsF: Polymer Science and EngineeringG: Food TechnologyH: Atmospheric and Oceanic SciencesI: Energy ScienceXH (Humanities and Social Sciences) Paper:Compulsory: Reasoning and Comprehension (25 marks)Optional: One section, 60 marksC1: EconomicsC2: EnglishC3: LinguisticsC4: PhilosophyC5: PsychologyC6: SociologyXL (Life Sciences) Paper:Compulsory: Chemistry (25 marks)Optional: Any two sections, 30 marks eachQ: BiochemistryR: BotanyS: MicrobiologyT: ZoologyU: Food TechnologyIMPORTANT GUIDELINES FOR APPLICANTSCandidates must know their paper codes, as they are required during both registration and the one application per candidate is allowed. If opting for a second paper, it must be added to the original application form. Multiple applications will be rejected, and fees will not be detailed syllabi, paper combinations, and other updates, aspirants can refer to the official GATE 2026 website.- Ends

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store