logo
This Tool Probes Frontier AI Models for Lapses in Intelligence

This Tool Probes Frontier AI Models for Lapses in Intelligence

WIRED02-04-2025

Apr 2, 2025 12:00 PM A new platform from data training company Scale AI will let artificial intelligence developers find their models' weak spots. Photo-illustration: Jacqui VanLiew; Getty Images
Executives at artificial intelligence companies may like to tell us that AGI is almost here, but the latest models still need some additional tutoring to help them be as clever as they can.
Scale AI, a company that's played a key role in helping frontier AI firms build advanced models, has developed a platform that can automatically test a model across thousands of benchmarks and tasks, pinpoint weaknesses, and flag additional training data that ought to help enhance their skills. Scale, of course, will supply the data required.
Scale rose to prominence providing human labor for training and testing advanced AI models. Large language models (LLMs) are trained on oodles of text scraped from books, the web, and other sources. Turning these models into helpful, coherent, and well-mannered chatbots requires additional 'post training' in the form of humans who provide feedback on a model's output.
Scale supplies workers who are expert on probing models for problems and limitations. The new tool, called Scale Evaluation, automates some of this work using Scale's own machine learning algorithms.
'Within the big labs, there are all these haphazard ways of tracking some of the model weaknesses,' says Daniel Berrios, head of product for Scale Evaluation. The new tool 'is a way for [model makers] to go through results and slice and dice them to understand where a model is not performing well,' Berrios says, 'then use that to target the data campaigns for improvement.'
Berrios says that several frontier AI model companies are using the tool already. He says that most are using it to improve the reasoning capabilities of their best models. AI reasoning involves a model trying to break a problem into constituent parts in order to solve it more effectively. The approach relies heavily on post-training from users to determine whether the model has solved a problem correctly.
In one instance, Berrios says, Scale Evaluation revealed that a model's reasoning skills fell off when it was fed non-English prompts. 'While [the model's] general purpose reasoning capabilities were pretty good and performed well on benchmarks, they tended to degrade quite a bit when the prompts were not in English,' he says. Scale Evolution highlighted the issue and allowed the company to gather additional training data to address it.
In recent months, Scale has contributed to the development of several new benchmarks designed to push AI models to become smarter, and to more carefully scrutinize how they might misbehave. These include EnigmaEval, MultiChallenge, MASK, and Humanity's Last Exam.
Scale says it is becoming more challenging to measure improvements in AI models, however, as they get better at acing existing tests. The company says its new tool offers a more comprehensive picture by combining many different benchmarks and can be used to devise custom tests of a model's abilities, like probing its reasoning in different languages. Scale's own AI can take a given problem and generate more examples, allowing for a more comprehensive test of a model's skills.
The company's new tool may also inform efforts to standardize testing AI models for misbehavior. Some researchers say that a lack of standardization means that some model jailbreaks go undisclosed.
In February, the US National Institute of Standards and Technologies announced that Scale would help it develop methodologies for testing models to ensure they are safe and trustworthy.
What kinds of errors have you spotted in the outputs of generative AI tools? What do you think are models' biggest blind spots? Let us know by emailing hello@wired.com or by commenting below.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Council tax bills set to rise at fastest rate for two decades, economist warns
Council tax bills set to rise at fastest rate for two decades, economist warns

Yahoo

timean hour ago

  • Yahoo

Council tax bills set to rise at fastest rate for two decades, economist warns

Council tax bills are set to rise at their fastest rate for two decades in the wake of Rachel Reeves' spending review, according to the head of the Institute for Fiscal Studies. Paul Johnson said that local government in England did 'perhaps a little bit better than it might have expected' out of the Chancellor's statement on Wednesday, but the 'sting in the tail' is the assumption that 'council tax bills will rise by 5% a year' as part of the funding. The core spending power of councils is set to increase by 2.6% a year from next year, and 'if English councils do choose 5% increases – and most almost certainly will – council tax bills look set to rise at their fastest rate over any parliament since 2001-05', Mr Johnson said on Thursday. On Wednesday, Ms Reeves said that ministers will not be 'going above' the 5% annual increases in council tax. She told ITV: 'The previous government increased council tax by 5% a year, and we have stuck to that. We won't be going above that. 'That is the council tax policy that we inherited from the previous government, and that we will be continuing.' The biggest winner from Wednesday's statement was the NHS, which will see its budget rise by £29 billion per year in real terms. Ruth Curtice, the chief executive of the Resolution Foundation, has said that Britain is turning into a 'National Health State'. Overnight, the think tank said Ms Reeves' announcements had followed a recent trend that saw increases for the NHS come at the expense of other public services. Ms Curtice said: 'Health accounted for 90% of the extra public service spending, continuing a trend that is seeing the British state morph into a National Health State, with half of public service spending set to be on health by the end of the decade.' Defence was another of Wednesday's winners, Ms Curtice said, receiving a significant increase in capital spending while other departments saw an overall £3.6 billion real-terms cut in investment. The Institute for Fiscal Studies (IFS) made similar arguments about 'substantial' investment in the NHS and defence coming at the expense of other departments, although Mr Johnson warned on Wednesday the money may not be enough. In his snap reaction to the review, Mr Johnson said: 'Aiming to get back to meeting the NHS 18-week target for hospital waiting times within this Parliament is enormously ambitious – an NHS funding settlement below the long-run average might not measure up. 'And on defence, it's entirely possible that an increase in the Nato spending target will mean that maintaining defence spending at 2.6% of GDP no longer cuts the mustard.' Ms Curtice added that low and middle-income families had also done well out of the spending review 'after two rounds of painful tax rises and welfare cuts', with the poorest fifth of families benefiting from an average of £1,700 in extra spending on schools, hospitals and the police. She warned that, without economic growth, another round of tax rises was likely to come in the autumn as the Chancellor seeks to balance the books. She said: 'The extra money in this spending review has already been accounted for in the last forecast. 'But a weaker economic outlook and the unfunded changes to winter fuel payments mean the Chancellor will likely need to look again at tax rises in the autumn.' Speaking after delivering her spending review, Ms Reeves insisted she would not have to raise taxes to cover her spending review. She told GB News: 'Every penny of this is funded through the tax increases and the changes to the fiscal rules that we set out last autumn.'

How to watch World Test Championship Final 2025 cricket on ICC TV (It's free)
How to watch World Test Championship Final 2025 cricket on ICC TV (It's free)

Tom's Guide

timean hour ago

  • Tom's Guide

How to watch World Test Championship Final 2025 cricket on ICC TV (It's free)

You can watch all five days of Australia vs South Africa live on ICC TV, streaming for free. The stream includes English commentary as the Aussie bowlers look to turn over Proteas batters cheaply following a 14 wicket opening day. The worldwide platform will show every wicket and boundary to a variety of fans across the globe - find the full list here. Can you access ICC TV in the U.S., U.K. and Australia? Read on and we'll show you how to watch World Test Championship 2025 live streams from anywhere with a VPN for FREE. Cricket fans in countries from Afghanistan to Vietnam can watch the 2025 World Test Championship live for FREE on ICC TV. You can sign into ICC TV via Google, Facebook or Apple accounts or alternatively your e-mail. Not at home right now? Use NordVPN or another VPN service to trick your device into thinking you're at home in one of the countries that has the coverage for free. We watched Day 1 and the quality on the platform was superb! Although ICC TV is only available in select countries, those who are from the nations streaming the action for free but visiting the likes of Australia, the U.S. and the U.K. can stream it through the use of VPN (Virtual Private Network). The software sets your devices to appear to be back in your home country regardless of where in the world you are. So, it's ideal for sports fans away on vacation or on business. Our favorite is NordVPN. It's the best on the market: There's a good reason you've heard of NordVPN. We specialize in testing and reviewing VPN services and NordVPN is the one we rate best. It's outstanding at unblocking streaming services, it's fast and it has top-level security features too. With over 7,000 servers, across 110 countries, and at a great price too, it's easy to recommend. Get up to 70% off now and an Amazon gift card if you're a U.S. or Canadian resident! It is really easy to watch, here's how. Using a VPN is incredibly simple. 1. Install the VPN of your choice. As we've said, NordVPN is our favorite. 2. Choose the location you wish to connect to in the VPN app. For instance if you're in the U.S. and want to view your Czech Republican service, you'd select Czech Republic from the list. 3. Sit back and enjoy the action. Head to ICC TV and watch Day 2 right now. ICC TV show full coverage of the action, with the first ball arriving at 10:30 a.m. (BST) each day. A star-studded commentary panel has arrived at Lord's for the final featuring former greats including Matthew Hayden, Graeme Smith, Stuart Broad, Shaun Pollock and Kevin Pietersen, alongside leading broadcasters Nasser Hussain, Ravi Shastri, Mel Jones, Ian Smith, Ian Bishop and Dinesh Karthik. Daily highlights are also provided if you have missed out on the day's action. Remember. Use NordVPN if you're outside your usual country on vacation. Australia Innings 1: 212 all out (56.4) B. Webster, 72 | S. Smith, 66 South Africa Innings 1: 43-4 (22) R. Rickelton, 16 | D. Bedingham, 8 We test and review VPN services in the context of legal recreational uses. For example: 1. Accessing a service from another country (subject to the terms and conditions of that service). 2. Protecting your online security and strengthening your online privacy when abroad. We do not support or condone the illegal or malicious use of VPN services. Consuming pirated content that is paid-for is neither endorsed nor approved by Future Publishing.

51Talk Online Education Group to Present on the Emerging Growth Conference on June 17, 2025.
51Talk Online Education Group to Present on the Emerging Growth Conference on June 17, 2025.

Yahoo

timean hour ago

  • Yahoo

51Talk Online Education Group to Present on the Emerging Growth Conference on June 17, 2025.

51Talk Online Education Group invites individual and institutional investors as well as advisors and analysts, to attend its real-time, interactive presentation on the Emerging Growth Conference. SINGAPORE, June 12, 2025 /PRNewswire/ -- 51Talk Online Education Group (the "Company") (NYSE American: COE ), a global online education platform with core expertise in English education, is pleased to announce that it has been invited to present on the Emerging Growth Conference on June 17, 2025. The next Emerging Growth Conference is presenting on June 17, 2025. This live, interactive online event will give existing shareholders and the investment community the opportunity to interact with the Company's investor relations vice president David Chung in real time. Mr. David Chung will perform a presentation and may subsequently open the floor for questions. Please submit your questions in advance to Questions@ or ask your questions during the event and Mr. David Chung will do his best to get through as many of them as possible. 51Talk Online Education Group will be presenting at 9:05 AM Eastern time for 30 minutes. Please register here to ensure you are able to attend the conference and receive any updates that are released. If attendees are not able to join the event live on the day of the conference, an archived webcast will also be made available on and on the Emerging Growth YouTube Channel, We will release a link to that after the event. About the Emerging Growth Conference The Emerging Growth conference is an effective way for public companies to present and communicate their new products, services and other major announcements to the investment community from the convenience of their office, in a time efficient manner. The Conference focus and coverage includes companies in a wide range of growth sectors, with strong management teams, innovative products & services, focused strategy, execution, and the overall potential for long term growth. Its audience includes potentially tens of thousands of Individual and Institutional investors, as well as Investment advisors and analysts. All sessions will be conducted through video webcasts and will take place in the Eastern time zone. About 51Talk Online Education Group 51Talk Online Education Group (NYSE American: COE) is a global online education platform with core expertise in English education. The Company's mission is to make quality education accessible and affordable. The Company's online and mobile education platforms enable students to take live interactive English lessons, on demand. The Company connects its students with a large pool of highly qualified teachers that it assembled using a shared economy approach, and employs student and teacher feedback and data analytics to deliver a personalized learning experience to its students. CONTACTS: 51Talk Online Education Group David ChungInvestor Relations Vice Presidentdavidchung@ Jinling WangInvestor Relations Managerwangjinling@ View original content: SOURCE 51Talk Online Education Group Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store