11-03-2025
MBZUAI develops Kazakh LLM in collaboration with Inception
AI has the potential to be the greatest equaliser in multilingual societies, breaking down language barriers, fostering inclusivity, and amplifying cultural identities on a global scale.
Realising its potential, Inception, a technology firm, recently released SHERKALA, a a Large Language Model in the Kazakh language. Developed in collaboration with Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and Cerebras, the model has been rigorously evaluated against human-curated Kazakh-specific benchmarks, covering culture, geography, and history, ensuring high accuracy in comprehension and generation. 'SHERKALA serves as a scalable and localized AI solution, demonstrating how AI can be leveraged to support linguistic diversity and expand access to technology,' said Dr. Larry Murray, VP of Applied Science, Inception
Excerpts from an interview:
What inspired the decision to launch a Large Language Model in the Kazakh language?
The development of SHERKALA stems from a commitment to AI inclusivity and the need to address the linguistic and technological gaps faced by underrepresented languages. With over 13 million Kazakh speakers, the language has been historically underrepresented in large-scale AI models, limiting access to high-quality AI-driven applications. While global LLMs have advanced significantly, they often fail to capture the linguistic, cultural, and contextual nuances of Kazakh, creating a gap in AI accessibility.
SHERKALA bridges this gap through state-of-the-art linguistic adaptation, trained on 45 billion words, primarily focusing on Kazakh while incorporating English, Russian, and Turkish.
How does SHERKALA align with Inception's broader AI strategy, especially in supporting underrepresented languages?
At Inception, our vision is to develop AI-native solutions that drive accessibility, equity, and innovation across global linguistic landscapes. SHERKALA is an extension of this vision, reinforcing our commitment to ensuring that no language is left behind in the AI transformation. It joins our portfolio of JAIS for Arabic and NANDA for Hindi, taking another step toward reshaping the global AI ecosystem.
What sets SHERKALA apart is its continuous pretraining from Llama 3.1 with a 25% expanded tokenizer, making Kazakh language processing as efficient as English in top-tier LLMs. This advancement enables greater token efficiency, precise word formation, and natural linguistic flow, overcoming longstanding challenges in Kazakh NLP. By leveraging cutting-edge research, scalable infrastructure, and strategic partnerships, we are systematically addressing the AI divide and reshaping how technology interacts with language.
How does SHERKALA's performance compare to larger models, such as the 70-billion-parameter LLMs? What sets it apart in terms of efficiency and accuracy?
SHERKALA has been trained on a meticulously curated dataset and optimized to perform comparably to larger models on key measures, proving that scale alone does not determine capability. Unlike most Kazakh-centric or multilingual models, SHERKALA has been rigorously benchmarked against human-curated Kazakh-specific tests, ensuring superior comprehension and contextual intelligence. Trained on Condor Galaxy, one of the world's most advanced AI supercomputers, SHERKALA achieves exceptional computational efficiency while maintaining high precision in both training and inference. With evaluations validating its superior generation quality, SHERKALA sets a new benchmark for open-source Kazakh LLM, by delivering fluid, contextually relevant, and culturally precise responses.
Which industries or sectors in Kazakhstan do you anticipate will benefit most from SHERKALA's capabilities?
SHERKALA offers AI-driven solutions that can support various sectors in Kazakhstan by improving access to technology in the Kazakh language. In education, it can assist with digital learning, automated tutoring, and translation support for students and educators. Government and public services may benefit from improved multilingual communication, policy drafting, and citizen engagement. Furthermore, in finance and legal sectors, SHERKALA can aid in document processing, contract analysis, and compliance automation. As AI adoption grows, SHERKALA provides a foundation for enhancing accessibility and efficiency across industries that rely on language processing and communication.
With SHERKALA joining JAIS and NANDA in your portfolio, what are your next steps for developing AI models for other linguistic communities?
Our journey to make AI accessible to all is just the beginning. With the success of JAIS, NANDA, the launch of SHERKALA has reinforced the need for more language-specific AI solutions, particularly for regions where digital infrastructure is evolving. Our next steps involve expanding our AI model portfolio to support additional languages that remain underserved in the AI ecosystem. We are exploring opportunities in Central Asia, Africa, and Southeast Asia, where linguistic diversity is rich, but AI representation remains limited. Moreover, we are focused on refining our models to enhance multilingual interoperability, enabling seamless AI interactions across different languages while maintaining linguistic integrity. Our partnerships with Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and technology leaders will continue to drive this vision forward, ensuring that AI becomes a universally accessible tool for every language and culture.
How do you foresee AI's role in multilingual societies, and what steps should governments and businesses take to adopt AI in a responsible and inclusive manner?
AI has the potential to be the greatest equaliser in multilingual societies, breaking down language barriers, fostering inclusivity, and amplifying cultural identities on a global scale. However, realizing this vision requires a structured and ethical approach from both governments and businesses. First, governments must champion open-source AI development while ensuring transparency, security, and fairness in language AI models. Investment in AI education and research will be crucial to fostering homegrown AI talent that understands local linguistic needs. Businesses, on the other hand, must adopt AI-driven solutions that are both inclusive and responsible, ensuring that AI does not reinforce biases or marginalize certain linguistic groups. Collaboration between the public and private sectors will be key in setting ethical AI frameworks, promoting responsible data practices, and ensuring that AI serves as an enabler of progress rather than a disruptor of linguistic diversity. SHERKALA is a prime example of how AI can be harnessed responsibly to empower communities and create a more connected, inclusive future.