Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He·December 05, 2024
Summary
The paper introduces a taxonomy for subjectivity dimensions and proposes a method to estimate user disagreement, enhancing alignment with aggregate preferences through synthetic judgments and margin term regularization. It addresses the gap between Large Language Model (LLM) development and deployment, focusing on preference tuning. The method improves alignment with human judgments, offering a principled approach to aggregate preference estimation. Regularization improves performance, especially on out-of-distribution datasets and subjective examples with multiple valid answers. The study suggests that traditional reward modeling approaches may not generalize well across different domains.
Advanced features