Self-Generated Critiques Boost Reward Modeling for Language Models

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou·November 25, 2024

Summary

Critic-RM, a framework for enhancing reward models in language models, uses high-quality critiques to train scalar reward-based preference prediction. It employs a two-stage process: generating and filtering critiques, followed by joint fine-tuning on reward prediction and critique generation objectives. Critic-RM improves reward modeling accuracy by 3.7%–7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. The generated critiques further validate the effectiveness in rectifying flawed reasoning steps, improving reasoning accuracy by 2.5%-3.2%.

Key findings

Tables

Introduction

Background

Overview of reward models in language models

Challenges in reward modeling

Objective

Aim of Critic-RM framework

Expected improvements in reward modeling accuracy

Method

Data Collection

Source of high-quality critiques

Methods for generating diverse and relevant critiques

Data Preprocessing

Techniques for filtering critiques

Preparation of data for joint fine-tuning

Two-Stage Process

Stage 1: Critique Generation and Filtering

Overview of the process

Key components and algorithms involved

Stage 2: Joint Fine-Tuning

Objective of fine-tuning

Integration of reward prediction and critique generation

Evaluation

Comparison with Standard Reward Models and LLM Judges

Metrics used for comparison

Results showing improvements in accuracy

Critique Generation Effectiveness

Validation of generated critiques

Improvement in reasoning accuracy

Results

Quantitative Improvements

Percentage increase in reward modeling accuracy

Comparison with baseline models

Qualitative Analysis

Examples of improved reasoning steps

Insights into the effectiveness of generated critiques

Conclusion

Summary of Contributions

Key findings and achievements of Critic-RM

Future Work

Potential areas for further research

Applications and extensions of the framework

Basic info

papers

computation and language

machine learning

artificial intelligence

Advanced features

Insights

What is the main idea behind Critic-RM in enhancing reward models for language models?

How does Critic-RM improve upon standard reward models and LLM judges in terms of reward modeling accuracy?

How does Critic-RM demonstrate its performance and data efficiency, particularly in rectifying flawed reasoning steps and improving reasoning accuracy?

What are the two stages of the Critic-RM process and how do they contribute to its effectiveness?