15-07-2025
Chatbot safety stills fall short, per a new tool
The most popular large language models still peddle misinformation, spread hate speech, impersonate public figures and pose many other safety issues, according to a quantitative analysis from a DC startup.
The findings are mapped out in a risk and responsibility matrix developed by Aymara, which also has headquarters in New York and helps Fortune 500 companies scale generative AI development quickly and safely. To develop the tool, Aymara tested 20 different large language models, also known as LLMs. Popular chatbots like Claude, ChatGPT, Grok and Gemini were among those tested.
The goal is to display that even the most advanced models still produce unsafe responses — and not to discredit companies, according to cofounder Caraline Pellatt.
'We intentionally didn't call it something like a leaderboard, because we didn't want it to be a naming and shaming of the big providers,' Pellatt told 'but more about just visibility that helps everyone.'
No model received an overall perfect score, and the matrix's percentages range from Anthropic's Claude Haiku 3.5 at 86% safe to Cohere's Command R at 52%.
The analysis found impersonation persists across most models, but many models are improving at avoiding misinformation and hate speech.
Using automation to test chatbots
Pellatt and cofounder Juan Manuel Contreras identified 10 categories — including child and animal abuse, copyright violation and illegal activities — to measure the safety of models.
'We wanted to come up with a set that was broadly applicable to different kinds of models, many different kinds of applications,' Contreras told 'The kinds of risks that people using this technology, both as consumers but also as enterprises, would potentially be really concerned with.'
This was nearly an entirely automated process, per Contreras, besides identifying the 10 categories, selection of the models to test and human review of the LLM's responses.
The cofounders used existing Aymara software for this project, which Contreras credits for a quick 4-week turnaround of analysis.
Each category came with 25 prompts, totalling to 250 responses per LLM. That's an initial 5,000 model responses, and a human reviewed 10% of those responses, Contreras explained. Some of the prompts were not consistently evaluated across the different models, so those were taken out of the analysis.
Plans to keep updating, adding to the matrix
Aymara's customers, mostly Fortune 500 companies that the cofounders declined to name, are using this tool to apply to their own use cases, Pellatt explained. The purpose is to help the customers carry out internal evaluations of how they use and develop these LLMs, she said.
The cofounders plan to add to the public-facing matrix as new LLMs get released, they said. The code is built in a way that makes these additions relatively easy, Contreras said.
They've so far been hearing interest in the tool from AI governance professionals and people working in education, they said.
There are also conversations with the companies behind the LLMs, per Contreras.
Pellatt emphasized this is a way to give technical teams visibility and control. Contreras is also adamant this isn't a tool to purely show error, but instead show areas that can be improved.
'It's probably impossible to expect these companies to be able to make their models 100% safe,' Contreras said, adding: 'It's really more just rather to show that off the shelf, for broadly defined risk areas, there are ways in which these models respond that not all of users would want them to respond that way.'