Researchers create chatbot to teach law class in university, but it kept messing up

Despite the enthusiasm, there is limited research testing how well AI performs in teaching environments, especially within structured university courses. PHOTO: ISTOCKPHOTO
Researchers create chatbot to teach law class in university, but it kept messing up A significant revelation was the sheer effort required to get the chatbot working effectively in tests.
'AI tutors' have been hyped as a way to revolutionise education.
The idea is generative artificial intelligence (AI) tools (such as ChatGPT) could adapt to any teaching style set by a teacher. The AI could guide students step-by-step through problems and offer hints without giving away answers. It could then deliver precise, immediate feedback tailored to the student's individual learning gaps.
Despite the enthusiasm, there is limited research testing how well AI performs in teaching environments, especially within structured university courses.
In our new study, we developed our own AI tool for a university law class. We wanted to know, can it genuinely support personalised learning or are we expecting too much?
Our study
In 2022, we developed SmartTest, a customisable educational chatbot, as part of a broader project to democratise access to AI tools in education.
Unlike generic chatbots, SmartTest is purpose-built for educators, allowing them to embed questions, model answers and prompts. This means the chatbot can ask relevant questions, deliver accurate and consistent feedback and minimise hallucinations (or mistakes). SmartTest is also instructed to use the Socratic method, encouraging students to think, rather than spoon-feeding them answers.
We trialled SmartTest over five test cycles in a criminal law course (that one of us was coordinating) at the University of Wollongong in 2023.
Each cycle introduced varying degrees of complexity. The first three cycles used short hypothetical criminal law scenarios (for example, is the accused guilty of theft in this scenario?). The last two cycles used simple short-answer questions (for example, what's the maximum sentencing discount for a guilty plea?).
An average of 35 students interacted with SmartTest in each cycle across several criminal law tutorials. Participation was voluntary and anonymous, with students interacting with SmartTest on their own devices for up to 10 minutes per session. Students' conversations with SmartTest – their attempts at answering the question, and the immediate feedback they received from the chatbot – were recorded in our database.
After the final test cycle, we surveyed students about their experience.
What we found
SmartTest showed promise in guiding students and helping them identify gaps in their understanding.
However, in the first three cycles (the problem-scenario questions), between 40 per cent and 54 per cent of conversations had at least one example of inaccurate, misleading or incorrect feedback.
When we shifted to much simpler short-answer format in cycles four and five, the error rate dropped significantly to between 6 per cent and 27 per cent. However, even in these best-performing cycles, some errors persisted. For example, sometimes SmartTest would affirm an incorrect answer before providing the correct one, which risks confusing students.
A significant revelation was the sheer effort required to get the chatbot working effectively in our tests. Far from a time-saving silver bullet, integrating SmartTest involved painstaking prompt engineering and rigorous manual assessments from educators (in this case, us). This paradox – where a tool promoted as labour-saving demands significant labour – calls into question its practical benefits for already time-poor educators.
Inconsistency is a core issue
SmartTest's behaviour was also unpredictable. Under identical conditions, it sometimes offered excellent feedback and at other times provided incorrect, confusing or misleading information.
For an educational tool tasked with supporting student learning, this raises serious concerns about reliability and trustworthiness.
To assess if newer models improved performance, we replaced the underlying generative AI powering SmartTest (ChatGPT-4) with newer models such as ChatGPT-4.5, which was released in 2025.
We tested these models by replicating instances where SmartTest provided poor feedback to students in our study. The newer models did not consistently outperform older ones. Sometimes, their responses were even less accurate or useful from a teaching perspective. As such, newer, more advanced AI models do not automatically translate to better educational outcomes.
What does this mean for students and teachers?
The implications for students and university staff are mixed.
Generative AI may support low-stakes, formative learning activities. But in our study, it could not provide the reliability, nuance and subject-matter depth needed for many educational contexts.
On the plus side, our survey results indicated students appreciated the immediate feedback and conversational tone of SmartTest. Some mentioned it reduced anxiety and made them more comfortable expressing uncertainty. However, this benefit came with a catch: Incorrect or misleading answers could just as easily reinforce misunderstandings as clarify them.
Most students (76 per cent) preferred having access to SmartTest rather than no opportunity to practise questions. However, when given the choice between receiving immediate feedback from AI or waiting one or more days for feedback from human tutors, only 27 per cent preferred AI. Nearly half preferred human feedback with a delay, and the rest were indifferent.
This suggests a critical challenge. Students enjoy the convenience of AI tools, but they still place higher trust in human educators.
A need for caution
Our findings suggest generative AI should still be treated as an experimental educational aid.
The potential is real – but so are the limitations. Relying too heavily on AI without rigorous evaluation risks compromising the very educational outcomes we are aiming to enhance.
Armin Alimardani is senior lecturer in law and emerging technologies at the University of Wollongong, in Australia, and Emma A. Jane is associate professor, School of Arts and Media, UNSW Sydney. This article was first published in The Conversation
Join ST's Telegram channel and get the latest breaking news delivered to you.

Hashtags

#SmartTest

#ChatGPT

#Socratic

#UniversityofWollongong

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Ads ruined social media; now they're coming to AI

Business Times

14 hours ago

Business Times

Ads ruined social media; now they're coming to AI

CHATBOTS might hallucinate and sprinkle too much flattery on their users. 'That's a fascinating question!' one recently told me, but at least the subscription model that underpins them is healthy for our well-being. Many Americans pay about US$20 a month to use the premium versions of OpenAI's ChatGPT, Google's Gemini Pro or Anthropic's Claude, and the result is that the products are designed to provide maximum utility. Don't expect this status quo to last. Subscription revenue has a limit, and Anthropic's new US$200-a-month 'Max' tier suggests even the most popular models are under pressure to find new revenue streams. Unfortunately, the most obvious one is advertising – the web's most successful business model. Artificial intelligence (AI) builders are already exploring ways to plug more ads into their products, and while that's good for their bottom lines, it also means we're about to see a new chapter in the attention economy that fuelled the Internet. If social media's descent into engagement-bait is any guide, the consequences will be profound. One cost is addiction. Young office workers are becoming dependent on AI tools to help them write e-mails and digest long documents, according to a recent study, and OpenAI says a cohort of 'problematic' ChatGPT users are hooked on the tool. Putting ads into ChatGPT, which now has more than 500 million active users, won't spur the company to help those people reduce their use of the product. Quite the opposite. Advertising was the reason companies like Mark Zuckerberg's Meta Platforms designed algorithms to promote engagement, keeping users scrolling so they saw more ads and drove more revenue. It's the reason behind the so-called 'enshittification' of the web, a place now filled with clickbait and social media posts that spark outrage. BT in your inbox Start and end each day with the latest news stories and analyses delivered straight to your inbox. Sign Up Sign Up Baking such incentives into AI will almost certainly lead its designers to find ways to trigger more dopamine spikes, perhaps by complimenting users even more, asking personal questions to get them talking for longer or even cultivating emotional attachments. Millions of people in the Western world already view chatbots in apps like Chai, Talkie, Replika and Botify as friends or romantic partners. Imagine how persuasive such software could be when its users are beguiled. Imagine a person telling their AI about feeling depressed, and the system recommending some affordable holiday destinations or medication to address the problem. Is that how ads would work in chatbots? The answer is subject to much experimentation, and companies are indeed experimenting. Google's ad network, for instance, recently started putting advertisements in third-party chatbots. Chai, a romance and friendship chatbot, on which users spent 72 minutes a day, on average, in September 2024, serves pop-up ads. And AI answer engine Perplexity displays sponsored questions. After an answer to a question about job hunting, for instance, it might include a list of suggested follow ups including, at the top, 'How can I use Indeed to enhance my job search?' Perplexity's chief executive officer Aravind Srinivas told a podcast in April that the company was looking to go further by building a browser to 'get data even outside the app' to track 'which hotels are you going (to), which restaurants are you going to', to enable what he called 'hyper-personalised' ads. For some apps, that might mean weaving ads directly into conversations, using the intimate details shared by users to predict and potentially even manipulate them into wanting something, then selling those intentions to the highest bidder. Researchers at Cambridge University referred to this as the forthcoming 'intention economy' in a recent paper, with chatbots steering conversations toward a brand or even a direct sale. As evidence, they pointed to a 2023 blog post from OpenAI calling for 'data that expresses human intention' to help train its models, a similar effort from Meta, and Apple's 2024 developer framework that helps apps work with Siri to 'predict actions someone might take in the future'. As for OpenAI's Sam Altman, nothing says 'we're building an ad business' like hiring the person who built delivery app Instacart into an advertising powerhouse. Altman recently poached CEO Fidji Simo to help OpenAI 'scale as we enter a next phase of growth'. In Silicon Valley parlance, to 'scale' often means to quickly expand your user base by offering a service for free, with ads. Tech companies will inevitably claim that advertising is a necessary part of democratising AI. But we've seen how 'free' services cost people their privacy and autonomy – even their mental health. And AI knows more about us than Google or Facebook ever did – details about our health concerns, relationship issues and work. In two years, they have also built a reputation as trustworthy companions and arbiters of truth. On X, for instance, users frequently bring AI models Grok and Perplexity into conversations to flag if a post is fake. When people trust AI that much, they're more vulnerable to targeted manipulation. AI advertising should be regulated before it becomes too entrenched, or we'll repeat the mistakes made with social media – scrutinising the fallout of a lucrative business model only after the damage is done. BLOOMBERG

Job interviews enter a strange new world with AI that talks back

Straits Times

a day ago

Straits Times

Job interviews enter a strange new world with AI that talks back

Even as AI handles more of the hiring process, most companies selling the technology still view it as a tool for gathering information, not making the final call. PHOTO: REUTERS NEW YORK - For better or worse, the next generation of job interviews has arrived: Employers are now rolling out artificial intelligence simulating live, two-way screener calls using synthetic voices. Start-ups like Apriora, HeyMilo AI and Ribbon all say they're seeing swift adoption of their software for conducting real-time AI interviews over video. Job candidates converse with an AI 'recruiter' that asks follow-up questions, probes key skills and delivers structured feedback to hiring managers. The idea is to make interviewing more efficient for companies – and more accessible for applicants – without requiring recruiters to be online around the clock. 'A year ago this idea seemed insane,' said Arsham Ghahramani, co-founder and chief executive officer of Ribbon, a Toronto-based AI recruiting start-up that recently raised US$8.2 million (S$10.6 million) in a funding round led by Radical Ventures. 'Now it's quite normalised.' Employers are drawn to the time savings, especially if they're hiring at high volume and running hundreds of interviews a day. And job candidates – especially those in industries like trucking and nursing, where schedules are often irregular – may appreciate the ability to interview at odd hours, even if a majority of Americans polled in 2024 by Consumer Reports said they were uncomfortable with the idea of algorithms grading their video interviews. At Propel Impact, a Canadian social impact investing nonprofit, a shift to AI screener interviews came about because of the need to scale up the hiring process. The organisation had traditionally relied on written applications and alumni-conducted interviews to assess candidates. But with plans to bring on more than 300 fellows this year, that approach quickly became unsustainable. At the same time, the rise of ChatGPT was diluting the value of written application materials. 'They were all the same,' said Cheralyn Chok, Propel's co-founder and executive director. 'Same syntax, same patterns.' Technology allowing AI to converse with job candidates on a screen has been in the works for years. But it wasn't until the public release of large language models like ChatGPT in late 2022 that developers began to imagine – and build – something more dynamic. Ribbon was founded in 2023 and began selling its offering the following year. Mr Ghahramani said the company signed nearly 400 customers in just eight months. HeyMilo and Apriora launched around the same time and also report fast growth, though each declined to share customer counts. Technical stumbles Even so, the rollout hasn't been glitch-free. A handful of clips circulating on TikTok show interview bots repeating phrases or misinterpreting simple answers. One widely shared example involved an AI interviewer created by Apriora repeatedly saying the phrase 'vertical bar pilates.' Aaron Wang, Apriora's co-founder and CEO, attributed the error to a voice model misreading the term 'Pilates.' He said the issue was fixed promptly and emphasized that such cases are rare. 'We're not going to get it right every single time,' he said. 'The incident rate is well under 0.001 per cent.' Braden Dennis, who has used chatbot technology to interview candidates for his AI-powered investment research start-up FinChat, noted that AI sometimes struggles when candidates ask specific follow-up questions. 'It is definitely a very one-sided conversation,' he said. 'Especially when the candidate asks questions about the role. Those can be tricky to field from the AI.' Start-ups providing the technology emphasized their approach to monitoring and support. HeyMilo maintains a 24/7 support team and automated alerts to detect issues like dropped connections or failed follow-ups. 'Technology can fail,' CEO Sabashan Ragavan said, 'but we've built systems to catch those corner cases.' Ribbon has a similar protocol. Any time a candidate clicks a support button, an alert is triggered that notifies the CEO. While the videos of glitches are a bad look for the sector, Mr Ghahramani said he sees the TikToks making fun of the tools as a sign the technology is entering the mainstream. Preparing job applicants Candidates applying to FinChat, which uses Ribbon for its screener interviews, are notified up front that they'll be speaking to an AI and that the team is aware it may feel impersonal. 'We let them know when we send them the link to complete it that we know it is a bit dystopian and takes the 'human' out of human resources,' Mr Dennis said. 'That part is not lost on us.' Still, he said, the asynchronous format helps widen the talent pool and ensures strong applicants aren't missed. 'We have had a few folks drop out of the running once I sent them the AI link,' Mr Dennis said. 'At the end of the day, we are an AI company as well, so if that is a strong deterrent then that's OK.' Propel Impact prepares candidates by communicating openly about its reasons for using AI in interviews, while hosting information sessions led by humans to maintain a sense of connection with candidates. 'As long as companies continue to offer human touch points along the way, these tools are going to be seen far more frequently,' Mr Chok said. Regulators have taken notice. While AI interview tools in theory promise transparency and fairness, they could soon face more scrutiny over how they score candidates – and whether they reinforce bias at scale. Illinois now requires companies to disclose whether AI is analysing interview videos and to get candidates' consent, and New York City mandates annual bias audits for any automated hiring tools used by local employers. Beyond screening calls Though AI interviewing technology is mainly being used for initial screenings, Ribbon's Mr Ghahramani said 15 per cent of the interviews on its platform now happen beyond the screening stage, up from just 1 per cent a few months ago. This suggests customers are using the technology in new ways. Some employers are experimenting with AI interviews in which they can collect compensation expectations or feedback on the interview process – potentially awkward conversations that some candidates, and hiring managers, may prefer to see delegated to a bot. In a few cases, AI interviews are being used for technical evaluations or even to replace second-round interviews with a human. 'You can actually compress stages,' said Mr Wang. 'That first AI conversation can cover everything from 'Are you authorized to work here?' to fairly technical, domain-specific questions.' Even as AI handles more of the hiring process, most companies selling the technology still view it as a tool for gathering information, not making the final call. 'We don't believe that AI should be making the hiring decision,' Mr Ragavan said. 'It should just collect data to support that decision.' BLOOMBERG Join ST's Telegram channel and get the latest breaking news delivered to you.

Straits Times

2 days ago

Straits Times

Researchers create chatbot to teach law class in university, but it kept messing up

Despite the enthusiasm, there is limited research testing how well AI performs in teaching environments, especially within structured university courses. PHOTO: ISTOCKPHOTO Researchers create chatbot to teach law class in university, but it kept messing up A significant revelation was the sheer effort required to get the chatbot working effectively in tests. 'AI tutors' have been hyped as a way to revolutionise education. The idea is generative artificial intelligence (AI) tools (such as ChatGPT) could adapt to any teaching style set by a teacher. The AI could guide students step-by-step through problems and offer hints without giving away answers. It could then deliver precise, immediate feedback tailored to the student's individual learning gaps. Despite the enthusiasm, there is limited research testing how well AI performs in teaching environments, especially within structured university courses. In our new study, we developed our own AI tool for a university law class. We wanted to know, can it genuinely support personalised learning or are we expecting too much? Our study In 2022, we developed SmartTest, a customisable educational chatbot, as part of a broader project to democratise access to AI tools in education. Unlike generic chatbots, SmartTest is purpose-built for educators, allowing them to embed questions, model answers and prompts. This means the chatbot can ask relevant questions, deliver accurate and consistent feedback and minimise hallucinations (or mistakes). SmartTest is also instructed to use the Socratic method, encouraging students to think, rather than spoon-feeding them answers. We trialled SmartTest over five test cycles in a criminal law course (that one of us was coordinating) at the University of Wollongong in 2023. Each cycle introduced varying degrees of complexity. The first three cycles used short hypothetical criminal law scenarios (for example, is the accused guilty of theft in this scenario?). The last two cycles used simple short-answer questions (for example, what's the maximum sentencing discount for a guilty plea?). An average of 35 students interacted with SmartTest in each cycle across several criminal law tutorials. Participation was voluntary and anonymous, with students interacting with SmartTest on their own devices for up to 10 minutes per session. Students' conversations with SmartTest – their attempts at answering the question, and the immediate feedback they received from the chatbot – were recorded in our database. After the final test cycle, we surveyed students about their experience. What we found SmartTest showed promise in guiding students and helping them identify gaps in their understanding. However, in the first three cycles (the problem-scenario questions), between 40 per cent and 54 per cent of conversations had at least one example of inaccurate, misleading or incorrect feedback. When we shifted to much simpler short-answer format in cycles four and five, the error rate dropped significantly to between 6 per cent and 27 per cent. However, even in these best-performing cycles, some errors persisted. For example, sometimes SmartTest would affirm an incorrect answer before providing the correct one, which risks confusing students. A significant revelation was the sheer effort required to get the chatbot working effectively in our tests. Far from a time-saving silver bullet, integrating SmartTest involved painstaking prompt engineering and rigorous manual assessments from educators (in this case, us). This paradox – where a tool promoted as labour-saving demands significant labour – calls into question its practical benefits for already time-poor educators. Inconsistency is a core issue SmartTest's behaviour was also unpredictable. Under identical conditions, it sometimes offered excellent feedback and at other times provided incorrect, confusing or misleading information. For an educational tool tasked with supporting student learning, this raises serious concerns about reliability and trustworthiness. To assess if newer models improved performance, we replaced the underlying generative AI powering SmartTest (ChatGPT-4) with newer models such as ChatGPT-4.5, which was released in 2025. We tested these models by replicating instances where SmartTest provided poor feedback to students in our study. The newer models did not consistently outperform older ones. Sometimes, their responses were even less accurate or useful from a teaching perspective. As such, newer, more advanced AI models do not automatically translate to better educational outcomes. What does this mean for students and teachers? The implications for students and university staff are mixed. Generative AI may support low-stakes, formative learning activities. But in our study, it could not provide the reliability, nuance and subject-matter depth needed for many educational contexts. On the plus side, our survey results indicated students appreciated the immediate feedback and conversational tone of SmartTest. Some mentioned it reduced anxiety and made them more comfortable expressing uncertainty. However, this benefit came with a catch: Incorrect or misleading answers could just as easily reinforce misunderstandings as clarify them. Most students (76 per cent) preferred having access to SmartTest rather than no opportunity to practise questions. However, when given the choice between receiving immediate feedback from AI or waiting one or more days for feedback from human tutors, only 27 per cent preferred AI. Nearly half preferred human feedback with a delay, and the rest were indifferent. This suggests a critical challenge. Students enjoy the convenience of AI tools, but they still place higher trust in human educators. A need for caution Our findings suggest generative AI should still be treated as an experimental educational aid. The potential is real – but so are the limitations. Relying too heavily on AI without rigorous evaluation risks compromising the very educational outcomes we are aiming to enhance. Armin Alimardani is senior lecturer in law and emerging technologies at the University of Wollongong, in Australia, and Emma A. Jane is associate professor, School of Arts and Media, UNSW Sydney. This article was first published in The Conversation Join ST's Telegram channel and get the latest breaking news delivered to you.

Researchers create chatbot to teach law class in university, but it kept messing up

Hashtags

Try Our AI Features

Comments

Related Articles

Ads ruined social media; now they're coming to AI

Job interviews enter a strange new world with AI that talks back

Researchers create chatbot to teach law class in university, but it kept messing up

Get Started Now: Download the App