3 days ago
[Inside K-AI] How benchmarks shape AI battlefield -- and where Korea's models stand
Standardized tests offer reality check, separating marketing buzz from genuine AI performance
The race for sovereign AI is intensifying, with countries rushing to build their own large language models to secure technological independence. Korea is no exception -- the government has tapped five leading companies to spearhead the creation of homegrown models tailored to national priorities. In this high-stakes contest, The Korea Herald launches a special series exploring Korea's AI industry and its standing in the global arena, and the rise of Korean-language-focused systems. This first installment looks at benchmarks -- the scorecards of the AI world -- and how Korean models measure up on the tests that are shaping the race. – Ed.
AI has swept across the tech industry, powering chatbots, search engines and productivity tools. OpenAI's ChatGPT -- which first ignited the global buzz in November 2022 -- and other big tech models sit firmly in the top tier, but the surge of large language models shows no sign of slowing.
Each new arrival is touted as the smartest or the first of its kind, outscoring the rest. That raises a key question: how are these models really evaluated, and which is the true leader?
The answer lies in benchmarks -- the standardized tests that have become the AI world's scoreboard, where companies race to climb the rankings and prove their worth.
In July, South Korea's Upstage pulled off an unexpected breakthrough when its 31-billion-parameter Solar Pro 2 became the only Korean model listed as a "frontier model" by UK-based benchmarking platform Artificial Analysis. It ranked just outside the global top 10 for intelligence and placed first in Intelligence vs. Cost to Run, a measure of how much capability a model delivers for its operating cost.
The result prompted swift reaction from Elon Musk, whose AI company xAI is also a relative newcomer battling entrenched leaders. In a post on X, he insisted his Grok 4 model "remains No. 1" and is "rapidly improving" -- a pointed defense that reflects how sensitive and strategic leaderboard positions have become in the global AI race.
Launching its latest GPT-5 model last week, OpenAI also promoted the model as "much smarter" than earlier ones and cited scores in several key benchmarks measuring performance in areas such as math, coding and visual perception.
"For engineers, benchmarks serve as a barometer for how the LLM they developed fares in the global competition, and as a compass for its future development," an official of an LLM startup said.
Constant race to set new records
Much like human IQ tests or university entrance exams, the benchmarks offer a structured way to measure various capabilities, from language comprehension and reasoning to code generation, under the same conditions. When an LLM tops a benchmark, it is deemed State-of-the-Art (SOTA) for that task -- a title that can quickly change as new models are released.
MMLU, which is one of the most widely used benchmarks, poses more than 15,000 multiple-choice questions across 57 subjects. HumanEval and LiveCodeBench test coding ability, while AIME and MATH-500 gauge mathematical reasoning.
For instance, OpenAI boasted that its new GPT-5 achieved SOTA in math, scoring 94.6 percent on AIME 2025 without tools; in real-world coding, scoring 74.9 percent on SWE-bench Verified; and in multimodal understanding, achieving 84.2 percent on MMMU, among others.
Korean LLM firms are also working fiercely to set new records. Releasing its most up-to-date model Exaone 4.0 on July 15, LG AI Research promoted its strong performance in advanced benchmarks. In MMLU-Pro, the 32-billion-parameter model scored 81.8 percent, ahead of Microsoft's Phi 4 reasoning-plus with 76 percent and Mistral's Magistral Small-2506 at 73.4 percent. In AIME 2025, it also outperformed those rivals with a score of 85.3 percent.
As LLMs advance rapidly, the benchmarks themselves are also evolving. MMLU now offers a Pro edition with more complex reasoning questions. In January, a coalition of 1,000 experts launched Humanity's Last Exam -- a 2,500-question test spanning classical literature to quantum chemistry.
But what often confuses the public is the endless list of scores. Experts note that because LLMs can do so many different things, each has its own strengths -- making it difficult to declare one model "the best" based on a single benchmark.
To make sense of the growing number of benchmark results, platforms like Hugging Face provide leaderboards that compile scores from multiple tests and rank models accordingly. The Artificial Analysis Intelligence Index is another prominent one that aggregates results from eight advanced benchmarks -- including the MMLU-Pro, Humanity's Last Exam and AIME -- to produce an overall score.
With strong scores across multiple benchmarks, LG's Exaone and Upstage's Solar Pro 2 were the only Korean LLMs to make the Artificial Analysis index in July.
At the time of release, Exaone 4.0 ranked 11th globally in the Intelligence Index, standing shoulder to shoulder with big brands such as Google's Gemini, OpenAI's ChatGPT and Alibaba's Qwen.
Upstage's Solar Pro 2 went a step further, becoming the only Korean model recognized in the leaderboard's Frontier Language Model Intelligence category -- reserved for the highest-performing systems at the cutting edge of research and development. It also topped the Intelligence vs. Cost to Run metric.
'It is fair to say Korean models are quite competitive, considering their rivals are often several times larger," an LG official said, explaining how models like Grok 4, which held the top spot in the July index, has a staggering 1.7 trillion parameters -- meaning it used far more resources in training to achieve the intelligence score.
The list has since updated its benchmarks with more challenging tests and added newly released models such as GPT-5 -- which overtook Grok 4 for the top spot -- nudging the Korean models down slightly, though both remain in the global index.
LG AI Research and Upstage have both been named among the government's five consortia tasked with leading the development of South Korea's proprietary AI foundation models, alongside Naver Cloud, SK Telecom and NC AI.
Naver, which became the third company in the world to develop a hyperscale AI model with HyperClova in 2021, has since upgraded its foundation model and in June released HyperClova X Think. The company cites its model's strength in its deep understanding of the Korean language.
Going beyond benchmarks
The way benchmarks gain recognition is similar to how a new measurement scale in the social sciences becomes a standard. After being published in a peer-reviewed paper, it should be validated at a reputable academic conference and adopted by the global AI community, an industry official explained.
As crowded as the AI field is becoming, with one LLM after another touting new benchmark scores, the results still serve an important purpose: they offer guidelines for engineers in measuring their progress.
"Global big techs still lead, but players in countries like China, France and Korea are closing in, and the race is intense," an LG official said. "The presence of Korean companies on leaderboards and key benchmarks shows the country is not only catching up but is firmly in the game."
At the same time, the rollout of GPT-5 shows that real-world user experiences are just as important as strong performances in advanced benchmark tests. Launched on August 7, the highly anticipated OpenAI model shot to the top in the Artificial Analysis Intelligence Index, but has faced backlash from users who claim it feels "downgraded," citing a blander personality and surprisingly basic mistakes.
Lee Kyoung-jun, a big data analytics professor at Kyung Hee University, stressed that the true measure of an LLM's competitiveness lies in its practical utility.
"Korean LLMs are making strides in benchmarks, but it's important to note that even major models like Exaone are having little impact on the general public for now," Lee said. "Efforts must continue to ensure these excellent models are adopted in real use cases and achieve widespread adoption."
herim@