logo
New Grok 4 Takes on ‘Humanity's Last Exam' as the AI Race Heats Up

New Grok 4 Takes on ‘Humanity's Last Exam' as the AI Race Heats Up

Elon Musk released the newest artificial intelligence model from his company xAI on Wednesday night. In an hour-long public reveal session, he called the model, Grok 4, 'the smartest AI in the world' and claimed it was capable of getting perfect SAT scores and near-perfect GRE results in every subject, from the humanities to the sciences.
During the online launch, Musk and members of his team described testing Grok 4 on a metric called Humanity's Last Exam (HLE)—a 2,500-question benchmark designed to evaluate an AI's academic knowledge and reasoning skill. Created by nearly 1,000 human experts across more than 100 disciplines and released in January 2025, the test spans topics from the classics to quantum chemistry and mixes text with images. Grok 4 reportedly scored 25.4 percent on its own. But given access to tools (such as external aids for code execution or Web searches), it hit 38.6 percent. That jumped to 44.4 percent with a version called Grok 4 Heavy, which uses multiple AI agents to solve problems. The two next best-performing AI models are Google's Gemini-Pro (which achieved 26.9 percent with the tools) and OpenAI's o3 model (which got 24.9 percent, also with the tools). The results from xAI's internal testing have yet to appear on the leaderboard for HLE, however, and it remains unclear whether this is because xAI has yet to submit the results or because those results are pending review. Manifold, a social prediction market platform where users bet play money (called 'Mana') on future events in politics, technology and other subjects, predicted a 1 percent chance, as of Friday morning, that Grok 4 would debut on HLE's leaderboard with a 45 percent score or greater on the exam within a month of its release. (Meanwhile xAI has claimed a score of only 44.4.)
During the launch, the xAI team also ran live demonstrations showing Grok 4 crunching baseball odds, determining which xAI employee has the 'weirdest' profile picture on X and generating a simulated visualization of a black hole. Musk suggested that the system may discover entirely new technologies by later this year—and possibly 'new physics' by the end of next year. Games and movies are on the horizon, too, with Musk predicting that Grok 4 will be able to make playable titles and watchable films by 2026. Grok 4 also has new audio capabilities, including a voice that sang during the launch, and Musk said new image generation and coding tools are soon to be released. The regular version of Grok 4 costs $30 a month; SuperGrok Heavy—the deluxe package with multiple agents and research tools—runs at $300.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Artificial Analysis, an independent benchmarking platform that ranks AI models, now lists Grok 4 as highest on its Artificial Analysis Intelligence Index, slightly ahead of Gemini 2.5 Pro and OpenAI's o4-mini-high. And Grok 4 appears as the top-performing publicly available model on the leaderboards for the Abstraction and Reasoning Corpus, or ARC-AGI-1, and its second edition, ARC-AGI-2 —benchmarks that measure progress toward 'humanlike' general intelligence. Greg Kamradt, president of ARC Prize Foundation, a nonprofit organization that maintains the two leaderboards, says that when the xAI team contacted the foundation with Grok 4's results, the organization then independently tested Grok 4 on a dataset to which the xAI team did not have access and confirmed the results. 'Before we report performance for any lab, it's not verified unless we verify it,' Kamradt says. 'We approved the [testing results] slide that [the xAI team] showed in the launch.'
According to xAI, Grok 4 also outstrips other AI systems on a number of additional benchmarks that suggest its strength in STEM subjects (read a full breakdown of the benchmarks here). Alex Olteanu, a senior data science editor at AI education platform DataCamp, has tested it. 'Grok has been strong on math and programming in my tests, and I've been impressed by the quality of its chain-of-thought reasoning, which shows an ingenious and logically sound approach to problem-solving,' Olteanu says. 'Its context window, however, isn't very competitive, and it may struggle with large code bases like those you encounter in production. It also fell short when I asked it to analyze a 170-page PDF, likely due to its limited context window and weak multimodal abilities.' (Multimodal abilities refer to a model's capacity to analyze more than one kind of data at the same time, such as a combination of text, images, audio and video.)
On a more nuanced front, issues with Grok 4 have surfaced since its release. Several posters on X —owned by Musk himself—as well as tech-industry news outlets have reported that when Grok 4 was asked questions about the Israeli-Palestinian conflict, abortion and U.S. immigration law, it often searched for Musk's stance on these issues by referencing his X posts and articles written about him. And the release of Grok 4 comes after several controversies with Grok 3, the previous model, which issued outputs that included antisemitic comments, praise for Hitler and claims of 'white genocide'—incidents that xAI publicly acknowledged, attributing them to unauthorized manipulations and stating that the company was implementing corrective measures.
At one point during the launch, Musk commented on how making an AI smarter than humans is frightening, though he said he believes the ultimate result will be good—probably. 'I somewhat reconciled myself to the fact that, even if it wasn't going to be good, I'd at least like to be alive to see it happen,' he said.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Rivian Is Integrating Google Maps Into Its Native Navigation Software
Rivian Is Integrating Google Maps Into Its Native Navigation Software

Motor Trend

time14 minutes ago

  • Motor Trend

Rivian Is Integrating Google Maps Into Its Native Navigation Software

Fresh off major upgrades to its R1S SUV and R1T pickup, Rivian is now set to launch a software update of its onboard navigation system to include Google Maps data—both for the new vehicles it's producing and its existing customer cars. This isn't simply Google Maps like the app on your phone, however. Leveraging Google's Automotive SDK, Rivian has, in effect, overlaid its existing navigation experience over that of Google Maps, taking advantage of the internet giant's superior "routing, estimated time of arrivals, traffic updates, search capabilities, and satellite imagery," as Rivian put it, in order to further augment its own in-house "EV-friendly navigation features." Rivian is updating its R1S SUV and R1T pickup navigation systems to integrate Google Maps, blending Google's routing, traffic, and imagery with Rivian's EV-specific features. The update includes a new interface and extends to the Rivian Mobile App, rolling out starting tomorrow. This summary was generated by AI using content from this MotorTrend article Read Next In other words, Rivian's navigation system will now blend Google's superior mapping capabilities with its proprietary charging information, including estimated range and battery capacity remaining at your destination, route planning, real-time charging info, and more. In addition, Rivian skins the whole interface in its own design, a new version of which will also debut with the Google Maps update. Even better, Rivian says the update will extend to its Rivian Mobile App (a 2025 MotorTrend Best Tech award winner), which benefits from Google-like photos and descriptions of searched destinations, as well as satellite map views and real-time traffic data. As before, users can send trips and navigation destinations from the app to their Rivians, and now they'll be able to use the "share" function via Google Maps to do the same. The rollout of the new Google-augmented system is imminent—Rivian says it'll begin doing so as soon as tomorrow via an over-the-air update for every all-electric R1S SUV and R1T pickup it has sold to date, and it will come already equipped with every new model it sells.

The government wants AI to fight wars and review your taxes
The government wants AI to fight wars and review your taxes

Yahoo

time22 minutes ago

  • Yahoo

The government wants AI to fight wars and review your taxes

Elon Musk has receded from Washington but one of his most disruptive ideas about government is surging inside the Trump administration. Artificial intelligence, Musk has said, can do a better job than federal employees at many tasks - a notion being tested by AI projects trying to automate work across nearly every agency in the executive branch. Subscribe to The Post Most newsletter for the most important and interesting stories from The Washington Post. The Federal Aviation Administration is exploring whether AI can be a better air traffic controller. The Pentagon is using AI to help officers distinguish between combatants and civilians in the field, and said Monday that its personnel would begin using the chatbot Grok offered by Musk's start-up, xAI, which is trying to gain a foothold in federal agencies. Artificial intelligence technology could soon play a central role in tax audits, airport security screenings and more, according to public documents and interviews with current and former federal workers. Many of these AI programs aim to shrink the federal workforce - continuing the work of Musk's U.S. DOGE Service that has cut thousands of government employees. Government AI is also promised to reduce wait times and lower costs to American taxpayers. Government tech watchdogs worry the Trump administration's automation drive - combined with federal layoffs - will give unproven technology an outsize role. If AI drives federal decision-making instead of aiding human experts, glitches could unfairly deprive people of benefits or harm public safety, said Elizabeth Laird, a director at the Washington-based nonprofit Center for Democracy and Technology. There is 'a fundamental mismatch' between what AI can do and what citizens expect from government, she said. President Joe Biden in 2023 signed an executive order aimed at spurring government use of AI, while also containing its risks. In January, President Donald Trump repealed that order. His administration has removed AI guardrails while seeking to accelerate its rollout. A comprehensive White House AI plan is expected this month. 'President Trump has long stressed the importance of American AI dominance, and his administration is using every possible tool to streamline our government and deliver more efficient results for the American people,' White House spokeswoman Anna Kelly said in a statement. The Washington Post reviewed government disclosures and interviewed current and former federal workers about plans to expand government AI. Some expressed alarm at the administration's disregard for safety and government staff. Others saw potential to improve efficiency. 'In government, you have so much that needs doing and AI can help get it done and get it done faster,' said Jennifer Pahlka, who was deputy U.S. chief technology officer in President Barack Obama's second term. Sahil Lavingia, a former DOGE staffer who pushed the Department of Veterans Affairs to use AI to identify potentially wasteful spending, said government should aggressively deploy the technology becoming so prevalent elsewhere. Government processes are efficient today, he said, 'but could be made more efficient with AI.' Lavingia argued no task should be off limits for experimentation, 'especially in war.' 'I don't trust humans with life and death tasks,' he said, echoing a maximalist view of AI's potential shared by some DOGE staffers. Here's how AI is being deployed within some government agencies embracing the technology. - - - Waging war The Pentagon is charging ahead with artificial intelligence this year. The number of military and civilian personnel using NGA Maven, one of the Pentagon's core AI programs, has more than doubled since January, said Vice Adm. Frank Whitworth, director of the National Geospatial-Intelligence Agency, in a May speech. The system, launched in 2017, processes imagery from satellites, drones and other sources to detect and identify potential targets for humans to assess. More than 25,000 U.S. military and civilian personnel around the world now use NGA Maven. NGA Maven is being expanded, Adm. Whitworth said, to interpret data such as audio and text in conjunction with imagery, offering commanders a 'live map' of military operations. The aim is to help it better distinguish combatants from noncombatants and enemies from allies, and for units using NGA Maven to be able to make 1,000 accurate decisions about potential targets within an hour. The Pentagon's AI drive under Trump will give tech companies like data-mining firm Palantir a larger role in American military power. A White House executive order and a Defense Department memo have instructed federal officials to rely more on commercial technology. In May, the Defense Department announced it was more than doubling its planned spending on a core AI system that is part of NGA Maven called Maven Smart System, allocating an additional $795 million. The software, provided by Palantir, analyzes sensor data to help soldiers identify targets and commanders to approve strikes. It has been used for planning logistics to support deployed troops. - - - Air traffic control The Federal Aviation Administration is testing whether AI software can reliably aid human air traffic controllers, according to a person with knowledge of the agency's plans who spoke on the condition of anonymity to avoid retaliation. Humans would remain in the loop, the person said, but AI would help reduce fatigue and distraction. Air traffic control staff would continue to communicate with pilots, for example, but AI might handle repetitive and data-driven tasks, monitoring airspace more generally. Due in part to ongoing staff shortages in air traffic control, the agency's AI plans include 'planning for less people,' the person said. Other uses for AI being explored at the FAA include analyzing air traffic or crash data and predicting when aircraft are likely to need maintenance, the person said. The FAA sees artificial intelligence as a potential tool to address airline safety concerns that were brought to the fore by the January midair collision that killed more than 60 people near Reagan National Airport. 'The FAA is exploring how AI can improve safety,' the agency said in a unsigned statement, but air traffic controllers do not currently use the technology. That includes using the technology to scan incident reports and other data to find risks around airports with a mixture of helicopter and airplane traffic, the statement said, while emphasizing humans will remain in charge. 'FAA subject matter experts are essential to our oversight and safety mission and that will never change,' the statement said. - - - Examining patents The U.S. Patent and Trademark Office wants to test whether part of the job of patent examiners - who review patent applications to determine their validity - can replaced by AI, according to records obtained by The Post and an agency employee who spoke on the condition of anonymity to describe internal deliberations. Patent seekers who opt into a pilot program will have their applications fed into an AI search tool that will trawl the agency's databases for existing patents with similar information. It will email applicants a list of the ten most relevant documents, with the goal of efficiently spurring people to revise, alter or withdraw their application, the records show. From July 21, per an email obtained by The Post, it will become 'mandatory' for examiners to use an AI-based search tool to run a similarity check on patent applications. The agency did not respond to a question asking if it is the same technology used in the pilot program that will email patent applicants. The agency employee said AI could have an expansive role at USPTO. Examiners write reports explaining whether applications fall afoul of patent laws or rules. The large language models behind recent AI systems like ChatGPT 'are very good at writing reports, and their ability to analyze keeps getting better,' the employee said. This month, the agency had planned to roll out another new AI search tool that examiners will be expected to use, according to internal documents reviewed by The Post. But the launch moved so quickly that concerns arose USPTO workers - and some top leaders - did not understand what was about to happen. Some staff suggested delaying the launch, the documents show, and it is unclear when it will ultimately be released. USPTO referred questions to the Commerce Department, which shared a statement from an unnamed spokesperson. 'At the USPTO, we are evaluating how AI and technology can better support the great work of our patent examiners,' the statement said. - - - Airport security screening You may see fewer security staff next time you fly as the Transportation Security Administration automates a growing number of tasks at airport checkpoints. TSA began rolling out facial recognition cameras to check IDs in 2022, a program now live in more than 200 airports nationwide. Despite studies showing that facial recognition is not perfect and less accurate at identifying people of color, the agency says it is more effective at spotting impostors than human reviewers. A federal report this year found TSA's facial recognition is more than 99 percent accurate across all demographic groups tested. The agency says it is experimenting with automated kiosks that allow pre-checked passengers to pass through security with 'minimal to no assistance' from TSA officers. During the Biden administration, these and other AI efforts at TSA were aimed at helping security officers be more efficient - not replacing them, said a former technology official at the Department of Homeland Security, TSA's parent agency, who spoke on the condition of anonymity to discuss internal matters. 'It frees up the officer to spend more time interacting with a passenger,' the former official said. The new Trump administration has indicated it wants to accelerate AI projects, which could reduce the number of TSA officers at airports, according to Galvin Widjaja, CEO of Austin-based a contractor which works with TSA and DHS on tools for screening airport travelers. 'If an AI can make the decision, and there's an opportunity to reduce the manpower, they're going to do that,' Widjaja said in an interview. Russ Read, a spokesman for TSA, said in an emailed statement that 'the future of aviation security will be a combination of human talent and technological innovation.' - - - Tax audits The Internal Revenue Service has an AI program to help employees query its internal manual, in addition to chatbots for a variety of internal uses. But the agency is now looking to off-load more significant tasks to AI tools. Once the new administration took over, with a mandate from DOGE that targeted the IRS, the agency examined the feasibility of deploying AI to manage tax audits, according to a person familiar with the matter, speaking on the condition of anonymity for fear of retribution. The push to automate work so central to the IRS's mission underscores a broader strategy: to delegate functions typically left to human experts to powerful software instead. 'The end game is to have one IT, HR, etc., for Treasury and get AI to do everything,' the person said. A DOGE official, start-up founder Sam Corcos, has been overseeing work to deploy AI more broadly at the IRS. But the lack of oversight of an ambitious effort to centralize the work of the IRS and feed it to a powerful AI tool has raised internal worries, the person said. 'The IRS has used AI for business functions including operational efficiency, fraud detection, and taxpayer services for a long time,' a Treasury Department spokeswoman said in a statement. 'Treasury CIO Sam Corcos is implementing the fulsome IRS modernization plan that taxpayers have deserved for over three decades.' - - - Caring for veterans In April, the Department of Veterans Affairs's top technology official emailed lieutenants with his interpretation of the Trump administration's new AI policy. 'The message is clear to me,' Charles Worthington, who serves as VA's chief technology officer and chief AI officer, said. 'Be aggressive in seizing AI opportunity, while implementing common sense safeguards to ensure these tools are trustworthy when they are used in VA's most sensitive areas such as benefit determinations and health care.' The email was published to VA's website in response to a public records request. VA said it deployed hundreds of uses of artificial intelligence last year, making it one of the agencies most actively tapping AI based on government disclosures. Among the most controversial of these programs has been REACH VET, a scoring algorithm used to prioritize mental health assistance to patients predicted to be at the highest risk of suicide. Last year, an investigation by the Fuller Project, a nonprofit news organization, found that the system prioritized help to White men, especially those who have been divorced or widowed - groups studies show to be at the highest risk of suicide. VA acknowledged that REACH VET previously did not consider known risk factors for suicide in women veterans, making it less likely that women struggling with thoughts of suicide would flagged for assistance. Pete Kasperowicz, a VA spokesman, said in an email that the agency recently updated the REACH VET algorithm to account for several new risk factors specific to women, including military sexual trauma, pregnancy, ovarian cysts and infertility. Since the program launched in 2017, it has helped identify more than 117,000 at-risk veterans, prompting staff to offer them additional support and services, he said. REACH VET was one of over 300 AI applications that the Biden administration labeled 'safety impacting' or 'rights impacting' in annual transparency reports. The Trump administration, which has derided the 'risk-averse approach of the previous administration,' discontinued those labels and will instead denote sensitive programs as 'high-impact.' GRAPHIC Related Content He may have stopped Trump's would-be assassin. Now he's telling his story. He seeded clouds over Texas. Then came the conspiracy theories. How conservatives beat back a Republican sell-off of public lands

Google Discover adds AI summaries, threatening publishers with further traffic declines
Google Discover adds AI summaries, threatening publishers with further traffic declines

Yahoo

time37 minutes ago

  • Yahoo

Google Discover adds AI summaries, threatening publishers with further traffic declines

As publishers fret about decreased traffic from Google, the search giant has begun rolling out AI summaries in Discover, the main news feed inside Google's search app on iOS and Android. Now, instead of seeing a headline from a major publication, users will see multiple news publishers' logos in the top-left corner, followed by an AI-generated summary that cites those sources. The app warns that these summaries are generated with AI, 'which can make mistakes.' The feature is not yet appearing for all news stories within the Google app, indicating this change is likely still a test. (Google has been asked for comment about the extent of the rollout, but has not responded.) In tests, TechCrunch was able to view the AI summaries firsthand across both iOS and Android apps in the U.S. In addition to the summaries, Google has been trying out other ways to present the news displayed in Discover. Though not flagged as powered by AI, some stories will include a set of bullet points below the headline or will be grouped with similar news. For instance, a story about President Trump's Ukraine deal also included links to other stories about Trump's latest actions. Meanwhile, a story from The Washington Post about ICE was followed by bullet points that summarized the story's content. The update to the search app comes as a number of publishers have been experimenting with AI on their own sites, including The Wall Street Journal, Yahoo, Bloomberg, USA Today, and others. Startups, too, have gotten in on the action, as with Particle, a news reader that uses AI to not only summarize stories but also allow users to see different sides or ask follow-up questions to better understand the topic covered. Despite these trials, there's significant concern in the publishing industry about how the shift to AI is impacting website traffic and referrals. With features like Google's AI Overviews and AI Mode, users no longer have to visit a website directly to get answers to their search queries — it can be summarized for them automatically or shared in a chatbot-style interface. Outside of Google, this same trend is seen across other AI apps, like ChatGPT or Perplexity. Recently, Google tried to appease publishers with the launch of Offerwall, a feature that allows publishers to generate revenue beyond the more traffic-dependent options, like ads. With Offerwall, publishers who use Google Ad Manager can try out different methods to provide access to their content, like micropayments or having users take surveys, sign up for newsletters, watch ads, and more. But for many publishers, these tools are coming too late, as traffic is already in a steep decline. A story by The Economist this week noted that worldwide search traffic fell by 15% year-over-year as of June, citing data from market intelligence company Similarweb. Earlier data from the firm also found that the number of news searches on the web that result in no click-throughs to news websites had grown from 56% in May 2024, when AI Overviews launched, to nearly 69% as of May 2025. Organic traffic also declined, dropping from over 2.3 billion visits at its peak in mid-2024 to fewer than 1.7 billion, it noted. Amid this shift, Google Discover still remained a source for clicks, even as traffic from Google Search declined. But that may no longer be the case if the AI summaries roll out more broadly within the Google in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store