MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Yuxin Chen, Chen Tang, Chenran Li, Ran Tian, Peter Stone, Masayoshi Tomizuka, Wei Zhan·June 24, 2024

Summary

MEREQ is a sample-efficient method for aligning robot behavior with human preferences in interactive imitation learning. It addresses the inefficiency of prior policy utilization by inferring a residual reward function that highlights the difference between the human expert's and prior policy's rewards. Using Residual Q-Learning, MEREQ fine-tunes the policy, leading to improved alignment with fewer human interventions. Experiments in simulated and real-world tasks show that MEREQ outperforms baseline methods in terms of sample efficiency and policy alignment, particularly in tasks like highway driving and bottle-pushing, where it reduces the need for expert input and achieves better feature distribution alignment. The study also compares MEREQ with MaxEnt and MaxEnt-FT, demonstrating its advantage in terms of efficiency and human effort reduction. However, the research also acknowledges limitations, such as reliance on simulations and potential instability due to high intervention variance, suggesting future work in offline or model-based reinforcement learning for enhanced performance and stability.