Latest news with #RLHF

How Direct Preference Optimization Can Bring User‑Driven Agility To AI

Forbes

06-08-2025

Business
Forbes

How Direct Preference Optimization Can Bring User‑Driven Agility To AI

Ashutosh Synghal, VP of Engineering at Midcentury Labs, pioneers of a decentralized AI & secure data exchange. Imagine training a voice‑recognition system without hand‑transcribing thousands of hours of audio. Traditional supervised learning demands that developers label every snippet with exact text—a costly, error‑prone bottleneck. Now flip the script: Present two candidate transcriptions and ask a reviewer which sounds closer to reality. That quick 'A or B?' encodes far more than it seems. Multiply it across samples, and you obtain a rich dataset of human judgment. This shift—teaching AI through preference rather than perfection—is powering a new training method called direct preference optimization (DPO). From Labels To Choices The recent boom in generative AI has exposed the pain of manual labeling. For tasks such as captioning images or refining a chatbot's tone, there is rarely a single 'correct' answer—only a spectrum of better or worse ones. DPO exploits that truth by optimizing models directly on comparisons: Which output did people prefer? The idea builds on reinforcement learning from human feedback (RLHF). RLHF asks humans to rank outputs, trains a separate reward model and then fine‑tunes the base model via reinforcement learning. It works, but the pipeline is heavy: three models, delicate reward tuning and weeks of compute. Stanford researchers showed you can drop the middleman. With DPO, you skip the reward model and the reinforcement loop. You simply fine‑tune the base model so that preferred answers become more probable and rejected ones less so. Alignment becomes a straightforward classification‑style loss, reducing complexity and instability. Faster, Cheaper, Often Better Because it eliminates reward modeling and iterative rollouts, DPO can reduce training time and compute budgets. Teams iterate in days, not weeks, and early studies find quality equal to—or slightly better than—classic RLHF on tasks such as sentiment control and summarization. In one benchmark, a language model tuned with DPO outranked its RLHF counterpart in human preferences while using a fraction of the resources. Why Preference Learning Wins Humans are far better at choosing between options than at crafting flawless answers from scratch. Pairwise votes or thumbs‑up/thumbs‑down signals capture that intuitive skill, bypassing the need for exhaustive gold‑standard datasets. A customer‑service bot, for instance, can launch with a starter model and rely on user clicks or ratings to improve continuously—no massive annotation campaign required. Organizations already sit on mountains of implicit preference data: A/B tests, search click‑through rates, star reviews. DPO transforms that by‑product into training fuel. Microsoft's Azure OpenAI team notes that customers 'often have preference data already collected' and can reach RLHF‑level quality with a far simpler workflow. The method also shines in subjective domains—speech, translation, multimodal generation—where 'correctness' is nuanced. Whether a voice assistant sounds friendly or an image caption feels apt is ultimately a matter of taste. By directly optimizing for those tastes, DPO teaches models tone, style and context in ways rigid labels cannot. Momentum In The Market Open‑source communities quickly embraced DPO as an accessible alignment strategy, and enterprise platforms are following. Azure OpenAI now offers DPO fine‑tuning in preview, citing equal effectiveness to RLHF with faster turnaround. Intel's NeuralChat and several startups report similar gains. The technique is moving from research curiosity to industry standard. Real‑World Impact For product builders, the benefits are tangible: • Speed: Preference loops compress iteration cycles, letting a two‑person team ship and refine a niche speech application in weeks. • Cost: Cutting out reward models can save compute and reduce carbon footprints. • Safety: Because humans review outputs, harmful or biased generations are spotted earlier. DPO's direct link to user sentiment can also curb model drift toward unwanted behaviors. Caveats To Address DPO isn't magic. Biased or low‑quality feedback will poison results, and narrow demographic sampling can overfit the model to a single user group. Teams must curate diverse, representative comparisons and periodically audit outcomes. The good news? The efficiency gains free up time and budget to do exactly that. Still, swapping labels for likes doesn't magically wash away bias. If the judgments you feed a model are lopsided or sloppy, DPO will learn those flaws just as efficiently. The fix isn't glamorous, but it's straightforward: Be explicit about what "better" means, draw signals from a genuinely mixed crowd, and watch how that crowd behaves over time. I start with a one‑page cheat sheet for reviewers—clarity, safety, usefulness, tone—so 'prefer A' isn't just a gut reaction, but a choice grounded in shared criteria. Diversity beats sheer volume. A thousand comparisons from one demographic tells you what that group likes, not what your market needs. I've had more success with smaller, stratified batches—different regions, expertise levels and even devices—than with massive but skewed logs. And because not all clicks carry the same signal, I quietly weigh feedback: Quick, low‑effort taps matter less than consistent raters whose judgments line up with peers. Maintenance is constant but light. I seed each batch with a few 'gold' pairs I already know the answer to; if accuracy on those slips, something's off—fatigue, fuzzy instructions or a pipeline bug. I also schedule periodic red‑team passes around sensitive topics. Those exercises surface blind spots and generate fresh comparison pairs that keep the model honest. The upside of DPO's efficiency is that you can afford this hygiene. When you're not burning weeks on reward tuning, you can spend that time auditing feedback, tightening guidelines and collecting smarter comparisons. In my own projects, that trade—less GPU thrash, more human rigor—has been the difference between a model that merely ships and one that actually feels aligned with its users. Listen To Your Users The future of AI training is looking a lot less like drudgery and a lot more like collaboration. By embracing preference‑based learning, we can reduce manual grind and gain a direct line to what people actually expect. In my own projects, models trained through the lens of human preference not only ship faster—they feel more attuned to users. In the high‑velocity AI market of 2025 and beyond, that alignment will be decisive. DPO proves that training smarter with feedback humans naturally provide unlocks better AI sooner—and it is fast becoming a cornerstone of modern development. Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Skywork-Reward-V2: Leading the New Milestone for Open-Source Reward Models

Yahoo

05-07-2025

Business
Yahoo

Skywork-Reward-V2: Leading the New Milestone for Open-Source Reward Models

SINGAPORE, July 5, 2025 /PRNewswire/ -- In September 2024, Skywork first open-sourced the Skywork-Reward series models and related datasets. Over the past nine months, these models and data have been widely adopted by the open-source community for research and practice, with over 750,000 cumulative downloads on the HuggingFace platform, helping multiple frontier models achieve excellent results in authoritative evaluations such as RewardBench. On July 4, 2025, Skywork continues to open-source the second-generation reward models - the Skywork-Reward-V2 series, comprising 8 reward models based on different base models of varying sizes, with parameters ranging from 600 million to 8 billion. These models have achieved top rankings across seven major mainstream reward model evaluation benchmarks. Skywork-Reward-V2 Download Links HuggingFace: GitHub: Technical Report: Reward models play a crucial role in the Reinforcement Learning from Human Feedback (RLHF) process. In developing this new generation of reward models, we constructed a hybrid dataset called Skywork-SynPref-40M, containing a total of 40 million preference pairs. To achieve large-scale, efficient data screening and filtering, Skywork specially designed a two-stage human-machine collaborative process that combines high-quality human annotation with the scalable processing capabilities of models. In this process, humans provide rigorously verified high-quality annotations, while Large Language Models (LLMs) automatically organize and expand based on human guidance. Based on the above high-quality hybrid preference data, we developed the Skywork-Reward-V2 series, which demonstrates broad applicability and excellent performance across multiple capability dimensions, including general alignment with human preferences, objective correctness, safety, resistance to style bias, and best-of-N scaling capability. Experimental validation shows that this series of models achieved the best performance on seven mainstream reward model evaluation benchmarks. 01 Skywork-SynPref-40M: Human-Machine Collaboration for Million-Scale Human Preference Data Screening Even the most advanced current open-source reward models still perform inadequately on most mainstream evaluation benchmarks. They fail to effectively capture the subtle and complex characteristics of human preferences, particularly when facing multi-dimensional, multi-level feedback. Additionally, many reward models tend to excel on specific benchmark tasks but struggle to transfer to new tasks or scenarios, exhibiting obvious "overfitting" phenomena. Although existing research has attempted to improve performance through optimizing objective functions, improving model architectures, and recently emerging Generative Reward Models, the overall effectiveness remains quite limited. We believe that the current fragility of reward models mainly stems from the limitations of existing preference datasets, which often have limited coverage, mechanical label generation methods, or lack rigorous quality control. Therefore, in developing the new generation of reward models, we not only continued the first generation's experience in data optimization but also introduced more diverse and larger-scale real human preference data, striving to improve data scale while maintaining data quality. Consequently, Skywork proposes Skywork-SynPref-40M - the largest preference hybrid dataset to date, containing a total of 40 million preference sample pairs. Its core innovation lies in a "human-machine collaboration, two-stage iteration" data selection pipeline. Stage 1: Human-Guided Small-Scale High-Quality Preference Construction The team first constructed an unverified initial preference pool and used Large Language Models (LLMs) to generate preference-related auxiliary attributes such as task type, objectivity, and controversy. Based on this, human annotators followed a strict verification protocol and used external tools and advanced LLMs to conduct detailed reviews of partial data, ultimately constructing a small-scale but high-quality "gold standard" dataset as the basis for subsequent data generation and model evaluation. Subsequently, we used preference labels from the gold standard data as guidance, combined with LLM large-scale generation of high-quality "silver standard" data, thus achieving data volume expansion. The team also conducted multiple rounds of iterative optimization: in each round, training reward models and identifying model weaknesses based on their performance on gold standard data; then retrieving similar samples and using multi-model consensus mechanisms for automatic annotation to further expand and enhance silver standard data. This human-machine collaborative closed-loop process continues iteratively, effectively improving the reward model's understanding and discrimination of preferences. Stage 2: Fully Automated Large-Scale Preference Data Expansion After obtaining preliminary high-quality models, the second stage turns to automated large-scale data expansion. This stage no longer relies on manual review but uses trained reward models to perform consistency filtering: If a sample's label is inconsistent with the current optimal model's prediction, or if the model's confidence is low, LLMs are called to automatically re-annotate; If the sample label is consistent with the "gold model" (i.e., a model trained only on human data) prediction and receives support from the current model or LLM, it can directly pass screening. Through this mechanism, the team successfully screened 26 million selected data points from the original 40 million samples, achieving a good balance between preference data scale and quality while greatly reducing the human annotation burden. 02 Skywork-Reward-V2: Matching Large Model Performance with Small Model Size Compared to the previous generation Skywork-Reward, Skywork newly released Skywork-Reward-V2 series provides 8 reward models trained based on Qwen3 and LLaMA3 series models, with parameter scales covering from 600 million to 8 billion. On seven mainstream reward model evaluation benchmarks including Reward Bench v1/v2, PPE Preference & Correctness, RMB, RM-Bench, and JudgeBench, the Skywork-Reward-V2 series comprehensively achieved current state-of-the-art (SOTA) levels. Compensating for Model Scale Limitations with Data Quality and Richness Even the smallest model, Skywork-Reward-V2-Qwen3-0.6B, achieves overall performance nearly matching the previous generation's strongest model, Skywork-Reward-Gemma-2-27B-v0.2, on average. The largest scale model, Skywork-Reward-V2-Llama-3.1-8B, achieved comprehensive superiority across all mainstream benchmark tests, becoming the currently best-performing open-source reward model overall. Broad Coverage of Multi-Dimensional Human Preference Capabilities Additionally, Skywork-Reward-V2 achieved leading results in multiple advanced capability evaluations, including Best-of-N (BoN) tasks, bias resistance capability testing (RM-Bench), complex instruction understanding, and truthfulness judgment (RewardBench v2), demonstrating excellent generalization ability and practicality. Highly Scalable Data Screening Process Significantly Improves Reward Model Performance Beyond excellent performance in evaluations, the team also found that in the "human-machine collaboration, two-stage iteration" data construction process, preference data that underwent careful screening and filtering could continuously and effectively improve reward models' overall performance through multiple iterative training rounds, especially showing remarkable performance in the second stage's fully automated data expansion. In contrast, blindly expanding raw data not only fails to improve initial performance but may introduce noise and negative effects. To further validate the critical role of data quality, we conducted experiments on a subset of 16 million data points from an early version. Results showed that training an 8B-scale model using only 1.8% (about 290,000) of the high-quality data already exceeded the performance of current 70B-level SOTA reward models. This result again confirms that the Skywork-SynPref dataset not only leads in scale but also has significant advantages in data quality. 03 Welcoming a New Milestone for Open-Source Reward Models: Helping Build Future AI Infrastructure In this research work on the second-generation reward model Skywork-Reward-V2, the team proposed Skywork-SynPref-40M, a hybrid dataset containing 40 million preference pairs (with 26 million carefully screened pairs), and Skywork-Reward-V2, a series of eight reward models with state-of-the-art performance designed for broad task applicability. We believe this research work and the continued iteration of reward models will help advance the development of open-source reward models and more broadly promote progress in Reinforcement Learning from Human Feedback (RLHF) research. This represents an important step forward for the field and can further accelerate the prosperity of the open-source community. The Skywork-Reward-V2 series models focus on research into scaling preference data. In the future, the team's research scope will gradually expand to other areas that have not been fully explored, such as alternative training techniques and modeling objectives. Meanwhile, considering recent development trends in the field - reward models and reward shaping mechanisms have become core components in today's large-scale language model training pipelines, applicable not only to RLHF based on human preference learning and behavior guidance, but also to RLVR including mathematics, programming, or general reasoning tasks, as well as agent-based learning scenarios. Therefore, we envision that reward models, or more broadly, unified reward systems, are poised to form the core of AI infrastructure in the future. They will no longer merely serve as evaluators of behavior or correctness, but will become the "compass" for intelligent systems navigating complex environments, helping them align with human values and continuously evolve toward more meaningful goals. Additionally, Skywork released the world's first deep research AI workspace agents in May, which you can experience by visiting: Media Contact Company Name: Skywork AI Person: Peter TianEmail: peter@ 2 Science Park DriveCountry: SingaporeWebsite: View original content to download multimedia: SOURCE Skywork AI pte ltd Sign in to access your portfolio

Latest news with #RLHF

How Direct Preference Optimization Can Bring User‑Driven Agility To AI

Skywork-Reward-V2: Leading the New Milestone for Open-Source Reward Models

Get Started Now: Download the App