
Debates over AI benchmarking have reached Pokémon
Last week, a post on X went viral, claiming that Google's latest Gemini model surpassed Anthropic's flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer's Twitch stream; Claude was stuck at Mount Moon as of late February.
But what the post failed to mention is that Gemini had an advantage: a minimap.
As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify "tiles" in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.
Now, Pokémon is a semi-serious AI benchmark at best — few would argue it's a very informative test of a model's capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.
For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model's coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a "custom scaffold" that Anthropic developed.
More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.
Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn't seem likely that it'll get any easier to compare models as they're released.
This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/14/debates-over-ai-benchmarking-have-reached-pokemon/

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Tom's Guide
18 minutes ago
- Tom's Guide
Google Photos just got a huge AI upgrade with 'Remix' — here's what you can do now
Google recently announced a new feature for Google Photos called Remix that turns your photos into various cartoon art styles using generative AI. Now, after a little over a month of waiting, the feature has started rolling out to users gradually, so it could be available on your phone soon. Google wrote a support post detailing how to use Remix and showing off the kinds of images it's capable of creating. Essentially, you feed Remix an image, and it lets you choose between a few different art styles to create a cartoon version of your photo. It seems somewhat limited now, with only four art styles, but it's easy to see it expanding into a more robust feature down the road. Currently, the styles available are: Those four art styles are some of the most popular looks, though. A nice benefit of using Google's Remix over ChatGPT or Gemini is that it's freely available in the Google Photos app — you don't need to sign up for another subscription just to turn yourself and your loved ones into funky cartoons. Google's blog post detailed how to use the app, and it's quite streamlined. Here's the basics: It's important to note that this is still rolling out, so you may not see it yet on your Google Photos account. If it's not showing up for you yet, just wait a bit longer. Get instant access to breaking news, the hottest reviews, great deals and helpful tips. Follow Tom's Guide on Google News to get our up-to-date news, how-tos, and reviews in your feeds. Make sure to click the Follow button.


TechCrunch
18 minutes ago
- TechCrunch
Google pushes AI into flight deals as antitrust scrutiny, competition heat up
Google on Thursday announced a new AI-powered search tool to help travelers find flight deals — even as regulators continue to question whether the search giant's dominance in travel discovery stifles competition. Called Flight Deals, the new tool is available within Google Flights and is designed to help 'flexible travelers' find cheaper fares. Users can type natural language queries into a search bar — describing how and when they want to travel — and the AI surfaces matching options. These queries can be like 'week-long trip this winter to a city with great food, nonstop only' or '10-day ski trip to a world-class resort with fresh powder,' Google said in a blog post. The tool uses its AI models to understand the nuances of what users are looking for and then goes through live Google Flights data to show relevant results, the company said. Regulators, including the European Commission, are currently investigating how Google may be favoring its own search products — including Google Flights — in ways that harm competition. EU regulators are eyeing Google for enforcement under the Digital Markets Act, aiming to rein in the power of major tech platforms. In response, the Alphabet-owned unit is reportedly planning to propose changes to appease regulators, including the addition of a price-comparison box in search results. Initially, Google has brought Flight Deals in beta, with plans to roll it out in the U.S., Canada, and India over the next week. The company said the goal of the beta release is 'to gather feedback and explore how AI can improve travel planning.' Google did not immediately respond to questions about which Gemini models power the tool, what data sources are used, or what privacy safeguards are in place. The move is part of a broader experiment as Google looks to compete with OpenAI, Anthropic, Perplexity, and other major AI players by integrating generative AI into travel search. Competitors like Expedia, and Indian travel aggregator MakeMyTrip have already rolled out their own AI integrations to streamline trip planning. In that sense, Google is arriving a bit late. But with its scale and reach, the company could still pose a serious challenge — if the tool proves effective and gains traction. Nonetheless, the classic Google Flights interface will continue to exist. The original flight search tool, launched in 2011, is even getting an update with an option to exclude basic economy fares for trips within the U.S. and Canada.


CNET
18 minutes ago
- CNET
Google Wants You to Pick Your Own News Sources for Searches
Perhaps in response to suggestions that its Search functions have been degraded or been usurped by AI summaries that not everybody wants, Google will now let you select news sources to narrow things down. The company said in a blog post this week that it's launching Preferred Sources in the US and India over the next few days, along with a plus icon to the right of Top Stories in searches. Clicking on that plus symbol allows you to add blogs or news outlets. There doesn't appear to be a limit on how many sources you can add. "Once you select your sources, they will appear more frequently in Top Stories or in a dedicated 'From your sources' section on the search results page. You'll still see content from other sites, and can manage your selections at any time," Google said. The new feature is the result of a Labs experiment. Google says that in that version, half of its users added four or more sources. Google offered advice to website publishers and owners on how to direct readers to add their site. Speaking of which, we'd be remiss if we didn't suggest adding CNET to your preferred Google search sources. We hear they do great work. What it means for news sites and their readers News organizations and other information sites have shifted before to cater to Google's search algorithm as well as those on other platforms including Facebook and Instagram. Publishers executed a pivot to video in the 2010s, and in recent years produced more bite-sized content suitable for sharing on platforms such as TikTok. Here's how you get to select your news sources. Google The addition of news preferences might be a double-edged sword, giving you more control over search results while further shutting out some legitimate news publishers as new echo chambers get built. "It's almost like a tone-deaf move by Google in my point of view, because news organizations are already concerned about losing traffic to the AI overviews," said Alex Mahadevan, director of MediaWise at Poynter, a nonprofit, nonpartisan media literacy program. "Now they have to figure out how to get people to pick their source in the source preferences." For bigger news publishers who have a loyal audience, Preferred Sources might prove that audience engagement efforts can pay off. But Mahadevan says it will depend on how willing people are to effectively subscribe to and curate their own news sources list. "I question how many people will actually use it," he said. People may see their own beliefs reinforced, not challenged Publishers who haven't cultivated engaged, loyal followers and don't have the means to steer their audiences might suffer, Mahadevan says. "The thing that does concern me about this is you know for the organizations that may have not done that, it's just going to further erode the amount of Google traffic they get," Mahadevan said. "If way more people want news from Fox News and are choosing Fox News among their source preferences, then that's going to be crowding out other news sites that might need that traffic." As an experiment, Mahadevan says he set Breitbart News Network as a source using the Google Search feature, saying he chose the far-right news source because it has been known to share misinformation. "I started Googling about tariffs and the first thing I see is Breitbart," he said. "So this concerns me also from a media literacy standpoint because I think it might further push people into echo chambers," where they only see beliefs that correspond with those they already hold. "It just seems like a way for people to narrow down their news diet even more via Google Search," Mahadevan said. If SEO, the way that websites have for decades have drawn Google traffic by generating good, relevant content, is effectively out the window, what does that mean for the future of publishing and media? "Is there a strong enough media literacy base for people to make sure they're choosing good legitimate news outlets and a varied variety of news sources?" Mahadevan asked. "I don't know if we're quite there yet."