5 days ago
3 Breakthrough Ways Data Is Powering The AI Reasoning Revolution
Olga Megorskaya is Founder & CEO of Toloka AI, a high quality data partner for all stages of AI development.
The buzz around reasoning models like DeepSeek R1, OpenAI o1 and Grok 3 signals a turning point in AI development that pivots on reasoning.
When we talk about reasoning, we mean that models can do more than repeat patterns—they think through problems step by step, consider multiple perspectives before giving a final answer and double-check their work. As reasoning skills improve, modern LLMs are pushing us closer to a future where AI agents can autonomously handle all sorts of tasks.
AI agents will become useful enough for widespread use when they learn to truly reason, meaning they adapt to new challenges, generalize skills from one area to apply them in a new domain, navigate multiple environments and reliably produce correct answers and outputs. Behind these emerging skills, you'll find sophisticated datasets used for training and evaluating the models. The better the data, the stronger the reasoning skills.
How is data shaping the next generation of reasoning models and agents? As a data partner to frontier labs, we've identified three ways that data drives AI reasoning right now: domain diversity and complexity, refined reasoning and robust evaluations.
By building stronger reasoning skills in AI systems, these new approaches to data for training and testing will open a door to the widespread adoption of AI agents.
Current models often train well in structured environments like math and coding, where answer verification is straightforward, fitting nicely into classical reinforcement learning frameworks. But the next leap requires pushing into more complex data across a wider knowledge spectrum. This is to achieve better generalization and performance as models transfer learning across areas.
Beyond math and coding, here's the kind of data becoming essential for training the next wave of AI:
These data points cover multi-step scenarios like web research trajectories with verification checkpoints.
This includes open-ended domains such as law or business consulting that have multifaceted answers, which makes them difficult to verify but important for advanced reasoning. Think of complex legal issues with multiple valid approaches or comprehensive market assessments with validation criteria.
Agent datasets are based on taxonomies of use cases, domains and categories as well as real-world tasks. For instance, a task for a corporate assistant agent would be to respond to a support request using simulated knowledge bases and company policies.
Agents also need contexts and environments that simulate how they interact with specific software, data in a CRM or knowledge base or other infrastructure. These contexts are created manually for agent training and testing.
The path a model takes to an answer is becoming as critical as the answer itself. As classical model training approaches are revisited, techniques like reward shaping (providing intermediate guidance) are vital. Current methods focus on guiding the process with feedback from human experts for better coherence, efficiency and safety:
This focuses on a model's "thinking" rather than the outcome by guiding it through logical reasoning steps or guiding an agent through interactions with the environment. Think of it like checking step-by-step proofs in math, where human experts review each step and identify where a model makes a mistake instead of evaluating the final answer.
Preference-based learning trains models to prioritize better reasoning paths. Experts review alternative paths and choose the best ones for models to learn from. This data can compare entire trajectories or individual steps in a process.
These include data crafted from scratch to show high-quality reasoning sequences, much like teaching by example. Another approach is to edit LLM reasoning steps to improve them and let the model learn from the corrections.
Current LLM evaluations have two main limitations: They struggle to provide meaningful signals of substantial improvements, and they are slow to adapt. The challenges mirror those in training data, including limited coverage of niche domains and specialized skills.
To drive real progress, benchmarks need to specifically address the quality and safety of reasoning models and agents. Based on our own efforts, here's how to collaborate with clients on evaluations:
Include a wider range of domains, specialized skill sets and more complex, real-world tasks. Move beyond single-metric evaluations to assess interdisciplinary and long-term challenges like forecasting.
Use fine-grained, use-case-specific metrics. Co-develop these with subject-matter experts to add depth and capture nuances that standard benchmarks miss.
As models develop advanced reasoning, safety evaluations must track the full chain of thought. For agents interacting with external tools or APIs, red teaming becomes critical. We recommend developing structured testing environments for red teamers and using the outcomes to generate new datasets focused on identified vulnerabilities.
Even as model architectures advance, data remains the bedrock. In the era of reasoning models and agents, the emphasis has shifted decisively toward data quality, diversity and complexity.
New approaches to data production are having a tremendous impact on the pace of AI development, urging reasoning models forward faster. With data providers upping their game to support the reasoning paradigm, we expect the near future to bring a wave of domain-specific, task-optimized reasoning agents—a new era of agentic AI.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?