21-04-2025
Study Shows Experienced Humans Can Spot Text Created By AI
In this, the generation of generative AI, it's been proven several times that people aren't especially good at telling whether written material was spit out by an AI bot or through a labor of human love.
The general inability to tell real McCoy from OpenAI has been a disaster for teachers, who've been overcome by AI-crafted homework and test answers, destroying what it means to earn an education. But they are not alone, as AI text has diluted every form of written communication. In this dynamic, where humans have proven to be unreliable at spotting AI text, dozens of companies have sprung up selling AI detection. Dozens more companies have been created to help the unscrupulous avoid being detected by those systems.
But new research from Jenna Russell, Marzena Karpinska, and Mohit Iyyer of the University of Maryland, Microsoft, and the University of Massachusetts, Amherst, respectively, shows that, on the essential job of detecting AI-created text, people may not be useless after all. The research team found that people who frequently use AI to create text can be quite good at spotting it.
In the study, the team asked a group of five humans to review three types of text – human written, AI created, and AI created but then altered or edited by systems that are designed to fool automated AI detectors. The last set is especially important because other tests have shown that editing text created by AI can confuse or degrade the accuracy of even the better AI detection systems.
Their headline finding is 'that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback.' The research found that experienced humans reliably detected AI text even after it had been altered and edited, and that – with one exception – the human investigators outperformed every automated detection system they tested. The exception was AI detection provider Pangram, which matched the 'near perfect detection accuracy' of the humans.
According to the research paper, Pangram and the human experts were 99.3% accurate at spotting text created by AI and 100% accurate at picking out the human text as human.
That experienced humans may be able to reliably detect output from AI bots is big news, but don't get too excited or think that we won't need computer-based AI detectors anymore.
For one, this paper asked the experienced humans to vote on whether written work was AI or not, using a majority vote of five experts to stand as the indicator. That means that to be really accurate at picking the automated from the authentic took five people, not one.
This is from the paper, 'The majority vote of our five expert annotators substantially outperforms almost every commercial and open-source detector we tested on these 300 articles, with only the commercial Pangram model matching their near-perfect detection accuracy.'
In fact, if you were to pit one single experienced human detector against the best automated system, Pangram was still more accurate. The automated detector, 'outperforms each expert individually,' says the paper.
And, somewhat troubling, the paper also says that individual human AI sleuths, on average, indicated that human written text was written by AI 3.3% of the time. For a single human reviewer, that false positive rate could be a problem when drawn over hundreds or even thousands of papers.
Moreover, while a group of experienced human reviewers were more accurate than a single human reviewer, hiring five people to review each and every writing composition is wildly impractical, which the study's authors concede. 'An obvious drawback is that hiring humans is expensive and slow: on average, we paid $2.82 per article including bonuses, and we gave annotators roughly a week to complete a batch of 60 articles,' the paper reports.
There aren't many settings in which $2.82 and a week's time – per paper – are plausible.
Still, in the world where parsing auto-bot text from real writing is essential, the paper has three important contributions.
First, finding that humans, experienced humans, can actually spot AI text is a significant foundation for further discussion.
Second, as the paper points out, human AI detectors have a real advantage over automated systems in that they can articulate why they suspect a section of text is fake. Humans, the report says, 'can provide detailed explanations of their decision-making process, unlike all of the automatic detectors in our study.' In many settings, that can be quite important.
Finally, knowing humans can do this – even a panel of humans – may afford us a viable second opinion or outside double-check on important cases of suspected AI use. In academic settings, scientific research, or intellectual property disputes as examples, having a good AI detector and a different way to spot likely AI text, could be deeply valuable, even if it takes longer and costs more.
In education settings, institutions that care about the rigor and value of their grades and degrees, could create a two-tier review system for written work – a fast, high-quality, and accurate automated review, followed by a human panel-like review in cases where authenticity is contested. In such a system, a double-verified finding could prove conclusive enough to take action, thereby protecting not only the schools and teachers, but also protecting the honest work of human writers, and the general public.