Latest news with #DeepSeekR1

AI Took on the Math Olympiad—But Mathematicians Aren't Impressed

Scientific American

2 days ago

Science
Scientific American

AI Took on the Math Olympiad—But Mathematicians Aren't Impressed

A defining memory from my senior year of high school was a nine-hour math exam with just six questions. Six of the top scorers won slots on the U.S. team for the International Math Olympiad (IMO), the world's longest running math competition for high school students. I didn't make the cut, but became a tenured mathematics professor anyway. This year's olympiad, held last month on Australia's Sunshine Coast, had an unusual sideshow. While 110 students from around the world went to work on complex math problems using pen and paper, several AI companies quietly tested new models in development on a computerized approximation of the exam. Right after the closing ceremonies, OpenAI and later Google DeepMind announced that their models earned (unofficial) gold medals for solving five of the six problems. Researchers like Sébastien Bubeck of OpenAI celebrated these models' successes as a ' moon landing moment ' by industry. But are they? Is AI going to replace professional mathematicians? I'm still waiting for the proof. On supporting science journalism If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. The hype around this year's AI results is easy to understand because the olympiad is hard. To wit, in my senior year of high school, I set aside calculus and linear algebra to focus on olympiad-style problems, which were more of a challenge. Plus the cutting-edge models still in development did so much better at the exam than the commercial models already out there. In a parallel contest administered by Gemini 2.5 pro, Grok 4, o3 high, o4-mini high and DeepSeek R1 all failed to produce a single completely correct solution. It shows that AI models are getting smarter, their reasoning capabilities improving rather dramatically. Yet I'm still not worried. The latest models just got a good grade on a single test—as did many of the students—and a head-to-head comparison isn't entirely fair. The models often employ a 'best-of- n ' strategy, generating multiple solutions and then grading themselves to select the strongest. This is akin to having several students work independently, then get together to pick the best solution and submit only that one. If the human contestants were allowed this option, their scores would likely improve too. Other mathematicians are similarly cautioning against the hype. IMO gold medalist Terence Tao (currently a mathematician at the University of California, Los Angeles) noted on Mastodon that what AI can do depends on what the testing methodology is. IMO president Gregor Dolinar said that the organization ' cannot validate the methods [used by the AI models], including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced.' Besides, IMO exam questions don't compare to the kinds of questions professional mathematicians try to answer, where it can take nine years, rather than nine hours, to solve a problem at the frontier of mathematical research. As Kevin Buzzard, a mathematics professor at Imperial College London, said in an online forum, 'When I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.' These days, mathematical research can take more than one lifespan to acquire the right expertise. Like many of my colleagues, I've been tempted to try 'vibe proving'—having a math chat with an LLM as one would with a colleague, asking 'Is it true that...' followed by a technical mathematical conjecture. The chatbot often then supplies a clearly articulated argument that, in my experience, tends to be correct when it comes to standard topics but subtly wrong at the cutting edge. For example, every model I've asked has made the same subtle mistake in assuming that the theory of idempotents behaves the same for weak infinite-dimensional categories as it does for ordinary ones, something that human experts (trust me on this) in my field know to be false. I'll never trust an LLM—which at its core is just predicting what text will come next in a string of words, based on what's in its dataset—to provide a mathematical proof that I can't verify myself. The good news is, we do have an automated mechanism for determining whether proofs can be trusted. Relatively recent tools called 'proof assistants' are software programs (they don't use AI) designed to check whether a logical argument proves the stated claim. They are increasingly attracting attention from mathematicians like Tao, Buzzard and myself who want more assurance that our own proofs are correct. And they offer the potential to help democratize mathematics and even improve AI safety. Suppose I received a letter, in unfamiliar handwriting, from Erode, a city in Tamil Nadu, India, purporting to contain a mathematical proof. Maybe its ideas are brilliant, or maybe they're nonsensical. I'd have to spend hours carefully studying every line, making sure the argument flowed step-by-step, before I'd be able to determine whether the conclusions are true or false. But if the mathematical text were written in an appropriate computer syntax instead of natural language, a proof assistant could check the logic for me. A human mathematician, such as I, would then only need to understand the meaning of the technical terms in the theorem statement. In the case of Srinivasa Ramanujan, a generational mathematical genius who did hail from Erode, an expert did take the time to carefully decipher his letter. In 1913 Ramanujan wrote to the British mathematician G. H. Hardy with his ideas. Luckily, Hardy recognized Ramanujan's brilliance and invited him to Cambridge to collaborate, launching the career of one of the all-time mathematical 'greats.' What's interesting is that some of the AI IMO contestants submitted their answers in the language of the Lean computer proof assistant so that the computer program could automatically check for errors in their reasoning. A start-up called Harmonic posted formal proofs generated by their model for five of the six problems, and ByteDance achieved a silver-medal level performance by solving four of the six problems. But the questions had to be written to accommodate the models' language limitations, and they still needed days to figure it out. Still, formal proofs are uniquely trustworthy. While so-called 'reasoning' models are prompted to break problems down into pieces and explain their 'thinking' step by step, the output is as likely to produce an argument that sounds logical but isn't, as to constitute a genuine proof. By contrast, a proof assistant will not accept a proof unless it is fully precise and fully rigorous, justifying every step in its chain-of-thought. In some circumstances, a hand-waving or approximate solution is good enough, but when mathematical accuracy matters, we should demand that AI-generated proofs are formally verifiable. Not every application of generative AI is so black and white, where humans with the right expertise can determine whether the results are correct or incorrect. In life, there is a lot of uncertainty and it's easy to make mistakes. As I learned in high school, one of the best things about math is the fact that you can prove definitively that some ideas are wrong. So I'm happy to have an AI try to solve my personal math problems, but only if the results are formally verifiable. And we aren't quite there, yet.

OpenAI rediscovers an open AI mission, with new reasoning models

Hindustan Times

4 days ago

Hindustan Times

OpenAI rediscovers an open AI mission, with new reasoning models

It was a long wait since the GPT-2 in 2019, but OpenAI is now releasing its newest open-weight large language models (LLMs). They've been dubbed GPT-OSS, the current lot consisting of gpt-oss-120b and gpt-oss-20b, dubbed 'reasoning models' with OpenAI claiming these models outperform similarly sized open models on reasoning tasks. The importance of this brings OpenAI back, in a way, to their original mission of building AI systems that benefit all of humanity. Over the years, the artificial intelligence (AI) company has faced criticism of distraction towards that stated mission, as competition escalated rapidly. OpenAI has not released open source models. (Official Image) 'Releasing gpt-oss-120b and gpt-oss-20b marks a significant step forward for open-weight models. At their size, these models deliver meaningful advancements in both reasoning capabilities and safety. Open models complement our hosted models, giving developers a wider range of tools to accelerate leading edge research, foster innovation and enable safer, more transparent AI development across a wide range of use cases,' the company says, in a statement. Also read:Fidji Simo, OpenAI's new CEO, insists AI can put power in the hands of people Two questions that need to be answered before we get into the specifics of gpt-oss-120b and gpt-oss-20b models. First, what are open weight LLMs and are they different from LLMs you regularly use? And secondly, what are reasoning models? The former is best defined as a large language model that is released by a company publicly, in its entirety, which means all the actual model weights (read this as parameters, defined by billion or 'b' in model names) and any user can download these models completely on their own hardware. In comparison, the most popular LLMs that you may have used, including OpenAI's own GPT models as well as the likes of Google Gemini 2.5 and Anthropic's Claude Sonnet 4, are closed models — that means they are accessible through an application layer while the model weights are not in the public domain. At the same time, Meta's Llama models, as well as certain models by Mistral, have followed the open weight methodology, in recent times. Open weight AI models are not to be confused with open source models however, the fine difference being that the latter models such as the DeepSeek R1 also make training code, datasets, and linked documentation available publicly — open weight models don't. Having the training code and data sets allows a user or developer to retrain an open-source model from scratch, often for customised usage scenarios. That flexibility isn't there for open weight models, though accessible in their entirety. OpenAI has not released open source models. To the second question, reasoning models slightly differ from a few other LLMs in the sense that they are specifically designed to spend more time 'thinking through' complex problems before generating their final response. They are expected to use extended reasoning processes to work through multi-step problems. Back to gpt-oss-120b and gpt-oss-20b, and the primary difference is in the number of parameters each has. Parameters are essentially like the strength of synapses in a human brain, which determines how strongly different 'neurons' influence each other, before providing an answer for a query. In OpenAI's naming scheme this time, there is a slight confusion — the gpt-oss-120b is a 117 billion parameter model, while the smaller gpt-oss-20b has 21 billion parameters. OpenAI's benchmark scores do peg the gpt-oss-120b and gpt-oss-20b close to the o3 and o4-mini models in most tests. Take for instance the MMLU benchmark, which consists of questions across academic disciplines — the gpt-oss-120b returned 90% accuracy while gpt-oss-20b clocked 85.3% accuracy; in comparison, the o3 (93.4%), o4-mini (93%) and o3-mini (87%) bookend the new open weight models. Just in case you are wondering about the memory requirements for downloading and running these open weight models on your computing device, OpenAI confirms that the gpt-oss-120B model will need 80GB of memory on the system, while gpt-oss-20b requires at least 16GB. They say that Microsoft is also bringing GPU-optimised versions of the gpt-oss-20b model to Windows devices.

DeepSeek effect: New free, open source ChatGPT rival GLM 4.5 breaks cover in China

India Today

29-07-2025

Business
India Today

DeepSeek effect: New free, open source ChatGPT rival GLM 4.5 breaks cover in China

A new AI model has been introduced in China. Called GLM-4.5, this is an open-source model unlike most of the US-based AI systems that are closed, and some benchmarks put it ahead of even DeepSeek R1 and ChatGPT. GLM 4.5 has been developed by startup (formerly known as Zhipu). According to the company, this model is designed specifically for intelligent agent tasks and uses what is described as an 'agentic' AI architecture. This means that the AI model can autonomously take on tasks and handle reasoning, coding and other applications more effectively. advertisementAccording to the company, GLM-4.5 is a large language model featuring 355 billion total parameters, with an optimised variant named GLM-4.5 Air that has 106 billion parameters, making it lighter and faster. The model supports a context window of 128,000 tokens, allowing it to process long conversations or documents without losing focus. It is also said to include native function calling, enabling seamless integration with external software and workflows. The company highlights that these capabilities make GLM-4.5 suitable for a wide range of applications, including advanced coding, physics simulations, game development and interactive company has also revealed that the GLM 4.5 has been released under an Apache 2.0 open-source licence. This means it is free for use and developers can freely download and deploy it, CNBC reported. claims that GLM4.5 ranks third globally and first among Chinese and open-source models across 12 major AI evaluation benchmarks. According to the company, the model scored 98.2 per cent on the MATH500 reasoning test and 91 per cent on the AIME24 challenge. It also delivered 64.2 per cent accuracy on SWE-Bench Verified, a benchmark used for software engineering tasks, and achieved a 90.6 per cent tool-calling success rate, edging out leading by Nvidia chip One of GLM4.5's biggest advantages is cost. CEO ZhangPeng told CNBC that the model can run on just eight Nvidia H20 chips. These chips are designed specifically for the Chinese market under US export controls. This is roughly half the hardware required by DeepSeek's comparable model. ZhangPeng revealed that the company does not currently need to purchase additional chips, indicating that the model already has sufficient computing than DeepSeekThe company reveals that it has also aggressively cut token pricing. will charge $0.11 per million input tokens compared with $0.14 for DeepSeek R1, and $0.28 per million output tokens. This is lower than the $2.19 charged by DeepSeek. Notably, tokens are the standard unit of data measurement for AI context, DeepSeek, the advanced LLM launched earlier this year, is developed by the Chinese startup High Flyer AI and is known for rivalling OpenAI's ChatGPT in natural language understanding and reasoning, while requiring significantly less training cost, reportedly under $6 million. Although GLM-4.5 is about half the size of DeepSeek, it is touted to be using agentic AI design to maintain high accuracy and flexibility in completing tasks with fewer computational the arrival of GLM-4.5 comes at a time when China's AI model development is seeing a rapid surge. By July, Chinese companies had released 1,509 large language models, more than any other country, according to the state-owned Xinhua news agency.- Ends

Alibaba Cloud founder says early innovation doesn't need top-dollar hires: 'What happened in Silicon Valley is not the winning formula'

Business Insider

28-07-2025

Business
Business Insider

Alibaba Cloud founder says early innovation doesn't need top-dollar hires: 'What happened in Silicon Valley is not the winning formula'

True innovation doesn't come from highly paid engineers, but from finding the right people to build the unknown, said the founder of Alibaba's cloud and AI unit. "The only thing you need to do is to get the right person," Wang Jian said in an interview with Bloomberg published Monday. "Not really the expensive person because if it's a new business, if it's true innovation, that basically means talent," he added. Wang, who built Alibaba Cloud in 2009, said American tech giants are "very much focused on the existing success of the business." "And existing — it's average of technology," the computer scientist said. "We have a tremendous opportunity to look at technology nobody knows today." "What happened in Silicon Valley is not the winning formula," Wang said. Wang's comments come after Big Tech companies are paying top dollar to recruit elite AI talent, a trend that's likened to sports franchises competing for superstar athletes like Cristiano Ronaldo. The competition reached another level when Meta recruited Scale's CEO, Alexandr Wang, last month as part of a $14.3 billion deal to take a 49% stake in his company. Then, Sam Altman, the CEO of OpenAI, said Meta had tried to poach his best employees with $100 million signing bonuses. Just weeks ago, Google paid $2.4 billion to hire the CEO and top talent of AI startup Windsurf and license its intellectual property. OpenAI had planned to buy Windsurf for $3 billion, but the deal fell apart. "It's a typical way of doing things," Wang Jian said of Big Tech's hiring strategy. Chasing the same pool of in-demand talent isn't always a winning move, he added. "Whenever everybody knows that these are talents," Wang said, "it's better for you not to get it." "It's really about the vision, you know, where you want to go." Wang and Alibaba did not respond to a request for comment from Business Insider. China's AI race is 'very healthy' competition Wang also said that the rivalry among Chinese AI firms is not cutthroat. No single person or company can sprint forever, he said. But collectively, the ecosystem can still move fast. He pointed to a pattern he's observed: One company surges ahead, then slows. Then another takes the lead. Over time, the first catches up again. "You can have the very fast iteration of the technology because of this competition," he said. "I don't think it's brutal, but I think it's very healthy," he added. China's biggest tech players have focused on open-source AI models, which have code and architecture that are publicly available for anyone to use, modify, or build on. One analyst told Business Insider previously that Chinese firms are prioritizing consolidation to stay competitive. For instance, Tencent has deployed its Hunyuan model and DeepSeek R1 across its massive ecosystem, including WeChat. Baidu has also integrated DeepSeek R1 into its search engine. The country is closing the gap with the US in the AI race. In a Stratechery interview earlier this year, Nvidia's CEO, Jensen Huang, said that China is doing "fantastic" in the AI market, with homegrown models like DeepSeek and Manus emerging as credible challengers to US-built systems. He said China's AI researchers are some of the best in the world, and it's no surprise that US companies like OpenAI and Anthropic are hiring them. "Our competition in China is really intense," Huang said in May at the Computex Taipei tech conference in Taiwan. Huang has also said that the US and China are neck and neck in the AI chip race. "China is right behind us. We're very, very close."

Why Vistra Corporation Rallied 40.6% in the First Half of 2025

Yahoo

14-07-2025

Business
Yahoo

Why Vistra Corporation Rallied 40.6% in the First Half of 2025

Vistra is a utility that nevertheless moves along with the AI trade. Although Vistra's stock was hammered after China's DeepSeek release and Trump's trade war, the stock managed to rally over the course of six months on top of a blockbuster 2024. Vistra continued to project strong profit growth, acquiring even more power generation assets in the first half to supply growing AI demand. 10 stocks we like better than Vistra › Shares of Vistra Corporation (NYSE: VST) rallied 40.6% in the first half of the year, according to data from S&P Global Market Intelligence. Vistra had already rallied 258% in 2024, making its first half performance all the more remarkable. As one of only a few publicly traded power producers with existing nuclear capacity, which was augmented by last year's acquisition of Energy Harbor, Vistra has emerged as a presumptive artificial intelligence winner. Like many AI-related stocks, Vistra rallied almost to first-half highs in January, before the release of China's DeepSeek R1 and Trump's tariff war caused a major AI sell-off. Yet also like other AI winners, Vistra recovered as it delivered strong results and a strong outlook, with artificial intelligence and its associated electricity demand growth seemingly intact. While one doesn't think of utilities as high-growth stocks, electricity demand is reaccelerating because of AI data center growth. U.S. electricity demand has actually been relatively flat since 2009, because of energy efficiency initiatives. However, the International Energy Agency sees demand accelerating to 2% annualized growth. For its part, Vistra sees its load growth accelerating to a 4% growth rate through 2030. With the need to serve that increased demand in a low-carbon manner, that means nuclear energy is now in high demand. Vistra is one of just a few U.S. power producers that owns an existing nuclear capacity, especially after acquiring Energy Harbor in March 2024. With the acquisition, Vistra expanded its nuclear facilities from one to four, becoming the second-largest provider of nuclear power in the country. Last quarter, nuclear power accounted for 26% of the company's energy production. Thanks to more favorable weather dynamic, Vistra reported strong earnings in its first quarter of the year, with adjusted earnings per share of $1.15 beating expectations by a large $0.37 margin. Management also reiterated its adjusted EBITDA (earnings before interest, taxes, depreciation and amortization) guidance for the year of $5.5 billion to $6.1 billion, with 2026 guidance of greater than $6 billion. About a week after Q1 earnings, Vistra also announced another acquisition, this time in the form of natural gas, when it acquired 2.6 GW of natural gas capacity from Lotus Infrastructure Partners, for $1.9 billion. Natural gas is Vistra's largest source of power generation, at 54% last quarter, and this acquisition will add to that. With natural gas and nuclear seemingly the big winners from renewed electricity demand because of AI, Vistra seems very well-positioned. Vistra's management has projected free cash flow of $3.0 to $3.6 billion this year before growth investments against a current market cap of $66 billion, making Vistra trade around 20 times free cash flow. That actually appears a reasonable valuation if Vistra can continue growing steadily and landing more deals with artificial intelligence data centers in the years ahead. However, if regulatory hurdles pop up or AI scaling stalls for any reason, the company could quickly lose its AI premium. Before you buy stock in Vistra, consider this: The Motley Fool Stock Advisor analyst team just identified what they believe are the for investors to buy now… and Vistra wasn't one of them. The 10 stocks that made the cut could produce monster returns in the coming years. Consider when Netflix made this list on December 17, 2004... if you invested $1,000 at the time of our recommendation, you'd have $671,477!* Or when Nvidia made this list on April 15, 2005... if you invested $1,000 at the time of our recommendation, you'd have $1,010,880!* Now, it's worth noting Stock Advisor's total average return is 1,047% — a market-crushing outperformance compared to 180% for the S&P 500. Don't miss out on the latest top 10 list, available when you join . See the 10 stocks » *Stock Advisor returns as of July 14, 2025 Billy Duberstein and/or his clients have no position in any of the stocks mentioned. The Motley Fool has no position in any of the stocks mentioned. The Motley Fool has a disclosure policy. Why Vistra Corporation Rallied 40.6% in the First Half of 2025 was originally published by The Motley Fool Sign in to access your portfolio