Latest news with #AIME2024


Indian Express
2 days ago
- Business
- Indian Express
OpenAI rolls out o3-Pro, its most capable reasoning model yet
AI powerhouse OpenAI has introduced a major upgrade with a massive price cut. The company has introduced the o3-Pro model, which is a giant leap in reasoning technology. The model can be accessed through ChatGPT Pro and Team subscriptions, and the company plans to roll out enterprise access next week. The model is also available through OpenAI's developer API. The O3-Pro model has been designed to handle complex tasks with use cases spanning across fields like technology and education. Another notable aspect of the o3-pro model is the massive price cuts. The model costs $20 for input and $80 for output per million tokens, and this is nearly 87 per cent cheaper than o1-pro. On the other hand, the base o3 model price dropped by 80 per cent to $2/$8 per million tokens. The new model comes with enhanced reasoning. Reportedly expert evaluations consistently opted for the o3-pro over the regular o3 model across categories. The new model has shown remarkable performance in programming, science, and even business tasks. When compared to previous reasoning models, the o3-pro model can search the web, analyse files, run Python code, remember conversations, etc. In expert evaluations, reviewers consistently prefer OpenAI o3-pro over o3, highlighting its improved performance in key domains—including science, education, programming, data analysis, and writing. Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness,… — OpenAI (@OpenAI) June 10, 2025 According to the company, reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy. Similar to O1-pro, OpenAI o3-pro excels at math, science, and coding based on its academic evaluations. The new OpenAI o3-pro is available in the model picker for Pro and Team users, and it replaces OpenAI o1-Pro. The company has said that Enterprise and Edu users will get access the week after. Moreover, as o3-pro uses the same underlying model as o3, and full safety details can be found in the o3 system card. The o3-pro model also comes with some limitations, such as it cannot generate images, does not support OpenAI's AI workspace – Canvas, and temporary chats are absent. Regardless, the model performs well in internal benchmarks. On AIME 2024, which evaluates math skills, the model outperformed Google's Gemini 2.5 Pro, and it also surpassed Anthropic's Claude 4 Opus on GPQA Diamond, a benchmark for PhD-level science knowledge.


The Star
3 days ago
- Business
- The Star
OpenAI releases AI reasoning model o3-pro
SAN FRANCISCO, June 10 (Xinhua) -- OpenAI on Tuesday announced the launch of o3-pro, the company's most advanced reasoning artificial intelligence (AI) model to date. O3-pro is a version of OpenAI's o3, a reasoning model first introduced earlier this year. Reasoning models solve problems step by step, making them more reliable in fields such as physics, mathematics, and computer programming. The o3-pro model is available starting Tuesday for ChatGPT Pro and Team users, replacing the previous o1-pro model. Enterprise and Edu users will gain access the following week, according to OpenAI. O3-pro is also now live in OpenAI's developer API as of Tuesday afternoon. The model is priced at 20 U.S. dollars per million input tokens and 80 dollars per million output tokens in the API. One million input tokens are approximately equivalent to 750,000 words. "In expert evaluations, reviewers consistently preferred o3-pro over o3 in every tested category -- especially in key areas like science, education, programming, business, and writing assistance," OpenAI stated. "Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy." O3-pro has access to a range of tools, including web browsing, file analysis, visual reasoning, Python execution, and personalized memory-based responses, according to the company. In internal testing, o3-pro achieved impressive results in widely used AI benchmarks. On AIME 2024, which assesses mathematical ability, o3-pro outperformed Google's Gemini 2.5 Pro. It also surpassed Anthropic's Claude 4 Opus on GPQA Diamond, a benchmark for PhD-level science knowledge, the company reported.
Yahoo
04-04-2025
- Business
- Yahoo
Stop chasing AI benchmarks—create your own
Every few months, a new large language model (LLM) is anointed AI champion, with record-breaking benchmark scores. But these celebrated metrics of LLM performance—such as testing graduate-level reasoning and abstract math—rarely reflect real business needs or represent truly novel AI frontiers. For companies in the market for enterprise AI models, basing the decision of which models to use on these leaderboards alone can lead to costly mistakes—from wasted budgets to misaligned capabilities and potentially harmful, domain-specific errors that benchmark scores rarely capture. Public benchmarks can be helpful to individual users by providing directional indicators of AI capabilities. And admittedly, some code-completion and software-engineering benchmarks, like SWE-Bench or Codeforces, are valuable for companies within a narrow range of coding-related, LLM-based business applications. But the most common benchmarks and public leaderboards often distract both businesses and model developers, pushing innovation toward marginal improvements in areas unhelpful for businesses or unrelated to areas of breakthrough AI innovation. The challenge for executives, therefore, lies in designing business-specific evaluation frameworks that test potential models in the environments where they'll actually be deployed. To do that, companies will need to adopt tailored evaluation strategies to run at scale using relevant and realistic data. The flashy benchmarks that model developers tout in their releases are often detached from the realities of enterprise applications. Consider some of the most popular ones: graduate-level reasoning (GPQA Diamond) and high school-level math tests, like MATH-500 and AIME2024. Each of these was cited in the releases for GPT o1, Sonnet 3.7, or DeepSeek's R1. But none of these indicators is helpful in assessing common enterprise applications like knowledge management tools, design assistants, or customer-facing chatbots. Instead of assuming that the "best" model on a given leaderboard is the obvious choice, businesses should use metrics tailored to their specific needs to work backward and identify the right model. Start by testing models on your actual context and data—real customer queries, domain-specific documents, or whatever inputs your system will encounter in production. When real data is scarce or sensitive, companies can craft synthetic test cases that capture the same challenges. Without real-world tests, companies can end up ill-fitting models that may, for instance, require too much memory for edge devices, have latency that's too high for real-time interactions, or have insufficient support for the on-premises deployment sometimes mandated by data governance standards. Salesforce has tried to bridge this gap between common benchmarks and their actual business requirements by developing its own internal benchmark for its CRM-related needs. The company created its own evaluation criteria specifically for tasks like prospecting, nurturing leads, and generating service case summaries—the actual work that marketing and sales teams need AI to perform. Popular benchmarks are not only insufficient for informed business decision-making but can also be misleading. Often LLM media coverage, including all three major recent release announcements, uses benchmarks to compare models based on their average performance. Specific benchmarks are distilled into a single dot, number, or bar. The trouble is that generative AI models are stochastic, highly input-sensitive systems, which means that slight variations of a prompt can make them behave unpredictably. A recent research paper from Anthropic rightly argues that, as a result, single dots on a performance comparison chart are insufficient because of the large error ranges of the evaluation metrics. A recent study by Microsoft found that using a statistically more accurate clustered-based evaluation in the same benchmarks can significantly change the rank ordering of—and public narratives about—models on a leaderboards. That's why business leaders need to ensure reliable measurements of model performance across a reasonable range of variations, done at scale, even if it requires hundreds of test runs. This thoroughness becomes even more critical when multiple systems are combined through AI and data supply chains, potentially increasing variability. For industries like aviation or healthcare, the margin of error is small and far beyond what current AI benchmarks typically guarantee, such that solely relying on leaderboard metrics can obscure substantial operational risk in real-world deployments. Businesses must also test models in adversarial scenarios to ensure the security and robustness of a model—such as a chatbot's resistance to manipulation by bad actors attempting to bypass guardrails—that cannot be measured by conventional benchmarks. LLMs are notably vulnerable to being fooled by sophisticated prompting techniques. Depending on the use case, implementing strong safeguards against these vulnerabilities could determine your technology choice and deployment strategy. The resilience of a model in the face of a potential bad actor could be a more important metric than the model's math or reasoning capabilities. In our view, making AI 'foolproof' is an exciting and impactful next barrier to break for AI researchers, one that may require novel model development and testing techniques. Start with existing evaluation frameworks. Companies should start by leveraging the strengths of existing automated tools (along with human judgment and practical but repeatable measurement goals). Specialized AI evaluation toolkits, such as DeepEval, LangSmith, TruLens, Mastra, or ARTKIT, can expedite and simplify testing, allowing for consistent comparison across models and over time. Bring human experts to the testing ground. Effective AI evaluation requires that automated testing be supplemented with human judgment wherever possible. Automated evaluation could include a comparison of LLM answers to ground truth answers, or the use of proxy metrics, such as automated ROUGE or BLEU scores, to gauge the quality of text summarization. For nuanced assessments, however, ones where machines still struggle, human evaluation remains vital. This could include domain experts or end-users conducting a 'blind' review of a sample of model outputs. Such actions can also flag potential biases in responses, such as LLMs giving responses about job candidates that are biased by gender or race. This human layer of review is labor-intensive, but can provide additional critical insight, like whether a response is actually useful and well-presented. The value of this hybrid approach can be seen in a recent case study where a company evaluated an HR-support chatbot using both human and automated tests. The company's iterative internal evaluation process with human involvement showed a significant source of LLM response errors was due to flawed updates to enterprise data. The discovery highlights how human evaluation can uncover systemic issues beyond the model itself. Focus on tradeoffs, not isolated dimensions of assessment. When evaluating models, companies must look beyond accuracy to consider the full spectrum of business requirements: speed, cost efficiency, operational feasibility, flexibility, maintainability, and regulatory compliance. A model that performs marginally better on accuracy metrics might be prohibitively expensive or too slow for real-time applications. A great example of this is how Open AI's GPT o1(a leader in many benchmarks at release time) performed when applied to the ARC-AGI prize. To the surprise of many, the o1 model performed poorly, largely due to ARC-AGI's 'efficiency limit' on the computing power used to solve the benchmark tasks. The o1 model would often take too long, using more compute time to try to come up with a more accurate answer. Most popular benchmarks don't have a time limit even though time would be a critically important factor for many business use cases. Tradeoffs become even more important in the growing world of (multi)-agentic applications, where simpler tasks can be handled by cheaper, quicker models (overseen by an orchestration agent), while the most complex steps (such as solving the broken-out series of problems from a customer) could need a more powerful version with reasoning to be successful. Microsoft Research's HuggingGPT, for example, orchestrates specialized models for different tasks under a central language model. Being prepared to change models for different tasks requires building flexible tooling that isn't hard-coded to a single model or provider. This built-in flexibility allows companies to easily pivot and change models based on evaluation results. While this may sound like a lot of extra development work, there are a number of available tools, like LangChain, LlamaIndex, and Pydantic AI, that can simplify the process. Turn model testing into a culture of continuous evaluation and monitoring. As technology evolves, ongoing assessment ensures AI solutions remain optimal while maintaining alignment with business objectives. Much like how software engineering teams implement continuous integration and regression testing to catch bugs and prevent performance degradation in traditional code, AI systems require regular evaluation against business-specific benchmarks. Similar to the practice of pharmacovigilance among users of new medicines, feedback from LLM users and affected stakeholders also needs to be continuously gathered and analyzed to ensure AI 'behaves as expected' and doesn't drift from its intended performance targets. This kind of bespoke evaluation framework fosters a culture of experimentation and data-driven decision-making. It also enforces the new and critical mantra: AI may be used for execution, but humans are in control and must govern AI. For business leaders, the path to AI success lies not in chasing the latest benchmark champions but in developing evaluation frameworks for your specific business objectives. Think of this approach as 'a leaderboard for every user,' as one Stanford paper suggests. The true value of AI deployment comes from three key actions: defining metrics that directly measure success in your business context; implementing statistically robust testing in realistic situations using your actual data and in your actual context; and fostering a culture of continuous monitoring, evaluation and experimentation that draws on both automated tools and human expertise to assess tradeoffs across models. By following this approach, executives will be able to identify solutions optimized for their specific needs without paying premium prices for 'top-notch models.' Doing this can hopefully help steer the model development industry away from chasing marginal improvements on the same metrics—falling victim to Goodhart's law with capabilities of limited use for business—and instead free them up to explore new avenues of innovation and the next AI breakthrough. Read other Fortune columns by François Candelon. Francois Candelon is a partner at private equity firm Seven2 and the former global director of the BCG Henderson Institute. Theodoros Evgeniou is a professor at INSEAD and a cofounder of the trust and safety company Tremau. Max Struever is a principal engineer at BCG-X and an ambassador at the BCG Henderson Institute. David Zuluaga Martínez is a partner at Boston Consulting Group and an ambassador at the BCG Henderson Institute. Some of the companies mentioned in this column are past or present clients of the authors' employers. This story was originally featured on Sign in to access your portfolio


Boston Globe
27-01-2025
- Business
- Boston Globe
What is China's DeepSeek and why is it freaking out the AI world?
What exactly is DeepSeek? DeepSeek was founded in 2023 by Liang Wenfeng, the chief of AI-driven quant hedge fund High-Flyer. The company develops AI models that are open-source, meaning the developer community at large can inspect and improve the software. Its mobile app surged to the top of the iPhone download charts in the US after its release in early January. Get Starting Point A guide through the most important stories of the morning, delivered Monday, Wednesday, and Friday. Enter Email Sign Up The app distinguishes itself from other chatbots like OpenAI's ChatGPT by articulating its reasoning before delivering a response to a prompt. The company claims its R1 release offers performance on par with OpenAI's latest and has granted license for individuals interested in developing chatbots using the technology to build on it. Advertisement How does DeepSeek R1 compare to OpenAI or Meta AI? Though not fully detailed by the company, the cost of training and developing DeepSeek's models appears to be only a fraction of what's required for OpenAI or Meta Platforms Inc.'s best products. The much better efficiency of the model puts into question the need for vast expenditures of capital to acquire the latest and most powerful AI accelerators from the likes of Nvidia Corp. That also amplifies attention on US export curbs of such advanced semiconductors to China — which were intended to prevent a breakthrough of the sort that DeepSeek appears to represent. DeepSeek says R1 is near or better than rival models in several leading benchmarks such as AIME 2024 for mathematical tasks, MMLU for general knowledge and AlpacaEval 2.0 for question-and-answer performance. It also ranks among the top performers on a UC Berkeley-affiliated leaderboard called Chatbot Arena. What's raising alarm in the US? Washington has banned the export of high-end technologies like GPU semiconductors to China, in a bid to stall the country's advances in AI, the key frontier in the US-China contest for tech supremacy. But DeepSeek's progress suggests Chinese AI engineers have worked their way around the restrictions, focusing on greater efficiency with limited resources. While it remains unclear how much advanced AI-training hardware DeepSeek has had access to, the company's demonstrated enough to suggest the trade restrictions have not been entirely effective in stymieing China's progress. Advertisement When did DeepSeek spark global interest? The AI developer has been closely watched since the release of its earliest model in 2023. Then in November, it gave the world a glimpse of its DeepSeek R1 reasoning model, designed to mimic human thinking. That model underpins its mobile chatbot app, which together with the web interface in January rocketed to global renown as a much cheaper OpenAI alternative, with investor Marc Andreessen calling it 'AI's Sputnik moment.' The DeepSeek mobile app was downloaded 1.6 million times by Jan. 25 and ranked No. 1 in iPhone app stores in Australia, Canada, China, Singapore, the US and the UK, according to data from market tracker App Figures. Who is DeepSeek's founder? Born in Guangdong in 1985, Liang received bachelor's and masters' degrees in electronic and information engineering from Zhejiang University. He founded DeepSeek with 10 million yuan ($1.4 million) in registered capital, according to company database Tianyancha. The bottleneck for further advances is not more fundraising, Liang said in an interview with Chinese outlet 36kr, but US restrictions on access to the best chips. Most of his top researchers were fresh graduates from top Chinese universities, he said, stressing the need for China to develop its own domestic ecosystem akin to the one built around Nvidia and its AI chips. 'More investment does not necessarily lead to more innovation. Otherwise, large companies would take over all innovation,' Liang said. Advertisement Where does DeepSeek stand in China's AI landscape? China's technology leaders, from Alibaba Group Holding Ltd. and Baidu Inc. to Tencent Holdings Ltd., have poured significant money and resources into the race to acquire hardware and customers for their AI ventures. Alongside Kai-Fu Lee's startup, DeepSeek stands out with its open-source approach — designed to recruit the largest number of users quickly before developing monetization strategies atop that large audience. Because DeepSeek's models are more affordable, it's already played a role in helping drive down costs for AI developers in China, where the bigger players have engaged in a price war that's seen successive waves of price cuts over the past year and a half. What are the implications for the global AI marketplace? DeepSeek's success may push OpenAI and other US providers to lower their pricing to maintain their established lead. It also calls into question the vast spending by companies like Meta and Microsoft Corp. — each of which has committed to capex of $65 billion or more this year, largely on AI infrastructure — if more efficient models can compete with a much smaller outlay. That roiled global stock markets as investors sold off companies like Nvidia Corp. and ASML Holding NV that have benefited from booming demand for AI services. Shares in Chinese names linked to DeepSeek, such as Iflytek Co., climbed. Already, developers around the world are experimenting with DeepSeek's software and looking to build tools with it. That could quicken the adoption of advanced AI reasoning models — while also potentially touching off additional concern about the need for guardrails around their use. DeepSeek's advances may hasten regulation to control how AI is developed. What are DeepSeek's shortcomings? Like all other Chinese AI models, DeepSeek self-censors on topics deemed sensitive in China. It deflects queries about the 1989 Tiananmen Square protests or geopolitically fraught questions such as the possibility of China invading Taiwan. In tests, the DeepSeek bot is capable of giving detailed responses about political figures like Indian Prime Minister Narendra Modi, but declines to do so about Chinese President Xi Jinping. Advertisement DeepSeek's cloud infrastructure is likely to be tested by its sudden popularity. The company briefly experienced a major outage on Jan. 27 and will have to manage even more traffic as new and returning users pour more queries into its chatbot.