Latest news with #AIevaluation

Learn How to Evaluate Large Language Models for Performance

Geeky Gadgets

23-06-2025

Business
Geeky Gadgets

Learn How to Evaluate Large Language Models for Performance

What if you could transform the way you evaluate large language models (LLMs) in just a few streamlined steps? Whether you're building a customer service chatbot or fine-tuning an AI assistant, the process of assessing your model's performance often feels like navigating a maze of technical jargon and scattered tools. But here's the truth: without proper evaluations, even the most advanced AI can fail to deliver accurate, reliable, and meaningful results. In this quick-start guide, Matthew Berman demystifies the art of LLM evaluations, showing you how to set up a robust process that ensures your AI solutions are not just functional but exceptional. With a focus on Retrieval-Augmented Generation (RAG) evaluations and Amazon Bedrock, this guide promises to make a once-daunting task surprisingly accessible. By the end of this tutorial, Matthew Berman explains how to configure a secure AWS environment, build a knowledge base, and implement structured evaluation metrics—all while using Amazon Bedrock's powerful tools like prompt management and safety guardrails. Along the way, you'll learn how to compare models, pinpoint weaknesses, and refine your AI for optimal performance. Whether you're a seasoned developer or just starting out, this guide offers actionable insights to help you evaluate LLMs with confidence and clarity. Ready to discover how a well-designed evaluation process can elevate your AI projects from good to new? Let's explore the possibilities together. LLM Evaluation with Amazon Bedrock The Importance of Model Evaluations Model evaluations are the cornerstone of building dependable AI systems. They ensure your AI delivers accurate, coherent, and contextually relevant results. For instance, if you're deploying a chatbot to answer questions about a 26-page hotel policy document, evaluations are essential to verify that the responses are both correct and meaningful. Evaluations also serve several key purposes: Benchmarking: Track your model's performance over time to monitor improvements or regressions. Track your model's performance over time to monitor improvements or regressions. Identifying weaknesses: Pinpoint areas where the model requires refinement. Pinpoint areas where the model requires refinement. Model comparison: Evaluate multiple models to determine the best fit for your specific use case. Without thorough evaluations, it becomes challenging to measure the effectiveness of your AI or ensure it meets user expectations. Understanding Amazon Bedrock Amazon Bedrock is a fully managed service designed to simplify working with LLMs. It provides access to a variety of AI models from providers such as Amazon, Meta, and Anthropic, along with tools to assist evaluation and deployment. Key features of Amazon Bedrock include: Agents: Automate workflows and repetitive tasks efficiently. Automate workflows and repetitive tasks efficiently. Safety guardrails: Ensure ethical and secure AI usage by preventing harmful or biased outputs. Ensure ethical and secure AI usage by preventing harmful or biased outputs. Prompt routing: Optimize query handling to improve response accuracy. Optimize query handling to improve response accuracy. Knowledge base integration: Seamlessly connect external data sources for enhanced contextual understanding. Seamlessly connect external data sources for enhanced contextual understanding. Prompt management: Organize, test, and refine prompts to improve model performance. These features make Amazon Bedrock an ideal platform for evaluating and optimizing LLMs, particularly in scenarios requiring external data integration and robust evaluation metrics. Setup LLM Evaluations Easily in 2025 Watch this video on YouTube. Check out more relevant guides from our extensive collection on Large Language Models (LLMs) that you might find useful. Practical Use Case: Chatbot for a Hotel Policy Document Imagine you are tasked with creating a chatbot capable of answering questions about a detailed hotel policy document. This scenario underscores the importance of integrating external knowledge bases and conducting thorough evaluations. By following the steps outlined below, you can set up and assess the chatbot's effectiveness, making sure it provides accurate and helpful responses to users. Step 1: Configure Your AWS Account Begin by setting up your AWS account. Create IAM users with the necessary permissions to access Amazon Bedrock, S3 buckets, and other AWS services. Ensure that permissions are configured securely to prevent unauthorized access. If required, adjust Cross-Origin Resource Sharing (CORS) settings to enable resource access from different origins. Proper configuration at this stage lays the foundation for a secure and efficient evaluation process. Step 2: Set Up S3 Buckets Amazon S3 buckets serve as the storage backbone for your evaluation process. Create and configure buckets to store essential resources, including: Knowledge base: The hotel policy document or other reference materials. The hotel policy document or other reference materials. Test prompts: A set of queries designed to evaluate the chatbot's responses. A set of queries designed to evaluate the chatbot's responses. Evaluation results: Data generated during the evaluation process for analysis. Implement proper access controls to secure sensitive data and ensure compliance with privacy standards. Step 3: Build the Knowledge Base Upload the hotel policy document to an S3 bucket and convert it into a vector store. A vector store transforms the document into a searchable format, allowing efficient querying by the LLM. Once the knowledge base is prepared, sync it with Amazon Bedrock to allow the model to access it during evaluations. This step ensures the chatbot can retrieve relevant information to answer user queries accurately. Step 4: Set Up RAG Evaluation Retrieval-Augmented Generation (RAG) evaluation combines the generative capabilities of LLMs with an external knowledge base to produce accurate and contextually relevant responses. In Amazon Bedrock, configure the following components: Inference models: Select the LLMs you wish to evaluate. Select the LLMs you wish to evaluate. Evaluation metrics: Define criteria such as correctness, coherence, and helpfulness to measure performance. Define criteria such as correctness, coherence, and helpfulness to measure performance. Test prompts: Use a diverse set of queries to evaluate the chatbot's ability to handle different scenarios. Store the evaluation results in your designated S3 bucket for further analysis. This structured approach ensures that the evaluation process is both comprehensive and repeatable. Step 5: Analyze Evaluation Results Once the evaluation is complete, review the results to assess the model's performance. Focus on key metrics such as correctness, coherence, and helpfulness to determine how effectively the chatbot answers questions. Compare the model's outputs with reference responses and ground truth data to identify discrepancies. Use performance distributions and other analytical tools to pinpoint areas that require improvement. This step is crucial for refining the model and making sure it meets user expectations. Step 6: Compare Models If you are testing multiple models, such as Nova Pro and Nova Premiere, use the evaluation results to compare their performance. Visualize differences in metrics to identify which model aligns best with your specific requirements. This comparison enables you to make an informed decision about which model to deploy, making sure optimal performance for your use case. Key Takeaways Evaluating LLMs is an essential step in deploying reliable and effective AI solutions. Amazon Bedrock simplifies this process by providing tools to test and compare models, integrate external knowledge bases, and customize evaluation metrics. By following this guide, you can optimize your AI implementations, making sure they meet user needs and deliver consistent, high-quality results. Media Credit: Matthew Berman Filed Under: AI, Guides Latest Geeky Gadgets Deals Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

Snorkel AI Raises $100 Million To Build Better Evaluators For AI Models

Forbes

29-05-2025

Business
Forbes

Snorkel AI Raises $100 Million To Build Better Evaluators For AI Models

Snorkel AI CEO Alex Ratner said his company is placing more emphasis on helping subject matter experts build datasets and models for evaluating AI systems. Alex Ratner, CEO of Snorkel AI remembers a time when data labeling —the grueling task of adding context to swathes of raw data and grading an AI model's response— was considered 'janitorial' work among AI researchers. But that quickly changed when ChatGPT stunned the world in 2022 and breathed new life (and billions of dollars) into a string of startups rushing to supply human-labeled data to the likes of OpenAI and Anthropic to train capable models. Now, the crowded field of data labelling appears to be undergoing another shift. Fewer companies are training large language models from scratch, leaving that task instead to the tech giants. Instead, they are fine-tuning models and building applications in areas like software development, healthcare and finance, creating demand for specialized data. AI chatbots no longer just write essays and haikus; they're being tasked with high stakes jobs like helping physicians make diagnoses or screening loan applications, and they're making more mistakes. Assessing a model's performance has become crucial for businesses to trust and ultimately adopt AI, Ratner said. 'Evaluation has become the new entry point,' he told Forbes. That urgency for measuring AI's abilities across very specific use cases has sparked a new direction for Snorkel AI, which is shifting gears to help enterprises create evaluation systems and datasets to test their AI models and adjust them accordingly. Data scientists and subject matter experts within an enterprise use Snorkel's software to curate and generate thousands of prompt and response pairs as examples of what a correct answer looks like to a query. The AI model is then evaluated according to that dataset, and trained on it to improve overall quality. The company has now raised $100 million in a Series D funding round led by New York-based VC firm Addition at a $1.3 billion valuation— a 30% increase from its $1 billion valuation in 2021. The relatively small change in valuation could be a sign that the company hasn't grown as investors expected, but Ratner said it's a result of a 'healthy correction in the broader market.' Snorkel AI declined to disclose revenue. Customer support experts at a large telecommunication company have used Snorkel AI to evaluate and fine tune its chatbot to answer billing related questions and schedule appointments, Ratner told Forbes. Loan officers at one of the top three U.S. banks have used Snorkel to train an AI system that mined databases to answer questions about large institutional customers, improving its accuracy from 25% to 93%, Ratner said. For nascent AI startup Rox that didn't have the manpower or time to evaluate its AI system for salespeople, Snorkel helped improve the accuracy by between 10% to 12%, Rox cofounder Sriram Sridharan told Forbes. It's a new focus for the once-buzzy company, which spun out of the Stanford Artificial Intelligence Lab in 2019 with a product that helped experts classify thousands of images and text. But since the launch of ChatGPT in 2022, the startup has been largely overshadowed by bigger rivals as more companies flooded the data labelling space. Scale AI, which also offers data labeling and evaluation services, is reportedly in talks to finalize a share sale at a $25 billion valuation, up from its $13.8 billion valuation a year ago. Other competitors include Turing, which doubled its valuation to $2.2 billion from 2021, and Invisible Technologies, which booked $134 million in 2024 revenue without raising much from VCs at all. Snorkel has faced macro challenges too: As AI models like those powering ChatGPT got better, they could label data on a massive scale for free, shrinking the size of the market further. Ratner acknowledged that Snorkel saw a brief period of slow growth right after OpenAI launched ChatGPT and said enterprises had paused pilots with some vendors to consider using AI models for labelling directly. But he said Snorkel's business bounced back in 2023 and has grown since. Ratner said Snorkel's differentiator is its emphasis on bringing in subject matter experts — either its own or those within a company– and using a proprietary method called 'programmatic labelling,' to automatically assign labels to massive troves of data through simple keywords or bits of code as opposed to doing it manually. The aim is to help time-crunched experts like doctors and lawyers label data faster and more economically. As it leans into evaluation, which also requires data generation, Snorkel has started hiring tens of thousands of skilled contractors like STEM professors, lawyers, accountants and fiction writers to create specialized datasets for multiple AI developers, who then use the datasets to evaluate their models (he declined to say which frontier AI labs Snorkel works with). They can also use this data to add new functionality to their chatbots, like the ability to break down and 'reason' about a difficult query or conduct in-depth research on a topic, Ratner said. But even when it comes to building specialized evaluations, Snorkel faces fierce competition— new and old. The top AI companies have released a number of public benchmarks and open source datasets to evaluate their models. LMArena, a popular leaderboard for evaluating AI model performance, recently spun out as a new company and raised $100 million in seed funding from top investors at a hefty $600 million valuation, according to Bloomberg. Plus, companies like Scale, Turing and Invisible, all offer evaluation services. But Ratner said that unlike its rivals, Snorkel was built around human experts right from the start. Saam Motamedi, a partner at Greylock who participated in the round, said these new specialized dataset services are a fast-growing part of Snorkel's business as the industry shifts to what's called 'post training' — the process of tweaking the model's performance for certain applications. AI has already soaked up most of the internet data, making datasets custom-made by domain experts even more valuable. 'I think that market tailwind has proven to be a really good one for Snorkel,' he said. MORE FROM FORBES

Latest news with #AIevaluation

Learn How to Evaluate Large Language Models for Performance

Snorkel AI Raises $100 Million To Build Better Evaluators For AI Models

Get Started Now: Download the App