Self-Generated Critiques Boost Reward Modeling for Language Models

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou·November 25, 2024

Summary

Critic-RM, a framework for enhancing reward models in language models, uses high-quality critiques to train scalar reward-based preference prediction. It employs a two-stage process: generating and filtering critiques, followed by joint fine-tuning on reward prediction and critique generation objectives. Critic-RM improves reward modeling accuracy by 3.7%–7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. The generated critiques further validate the effectiveness in rectifying flawed reasoning steps, improving reasoning accuracy by 2.5%-3.2%.

Key findings

3

Tables

4

Introduction
Background
Overview of reward models in language models
Challenges in reward modeling
Objective
Aim of Critic-RM framework
Expected improvements in reward modeling accuracy
Method
Data Collection
Source of high-quality critiques
Methods for generating diverse and relevant critiques
Data Preprocessing
Techniques for filtering critiques
Preparation of data for joint fine-tuning
Two-Stage Process
Stage 1: Critique Generation and Filtering
Overview of the process
Key components and algorithms involved
Stage 2: Joint Fine-Tuning
Objective of fine-tuning
Integration of reward prediction and critique generation
Evaluation
Comparison with Standard Reward Models and LLM Judges
Metrics used for comparison
Results showing improvements in accuracy
Critique Generation Effectiveness
Validation of generated critiques
Improvement in reasoning accuracy
Results
Quantitative Improvements
Percentage increase in reward modeling accuracy
Comparison with baseline models
Qualitative Analysis
Examples of improved reasoning steps
Insights into the effectiveness of generated critiques
Conclusion
Summary of Contributions
Key findings and achievements of Critic-RM
Future Work
Potential areas for further research
Applications and extensions of the framework
Basic info
papers
computation and language
machine learning
artificial intelligence
Advanced features
Insights
What is the main idea behind Critic-RM in enhancing reward models for language models?
How does Critic-RM improve upon standard reward models and LLM judges in terms of reward modeling accuracy?
How does Critic-RM demonstrate its performance and data efficiency, particularly in rectifying flawed reasoning steps and improving reasoning accuracy?
What are the two stages of the Critic-RM process and how do they contribute to its effectiveness?