MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Yuxin Chen, Chen Tang, Chenran Li, Ran Tian, Peter Stone, Masayoshi Tomizuka, Wei Zhan·June 24, 2024

Summary

MEREQ is a sample-efficient method for aligning robot behavior with human preferences in interactive imitation learning. It addresses the inefficiency of prior policy utilization by inferring a residual reward function that highlights the difference between the human expert's and prior policy's rewards. Using Residual Q-Learning, MEREQ fine-tunes the policy, leading to improved alignment with fewer human interventions. Experiments in simulated and real-world tasks show that MEREQ outperforms baseline methods in terms of sample efficiency and policy alignment, particularly in tasks like highway driving and bottle-pushing, where it reduces the need for expert input and achieves better feature distribution alignment. The study also compares MEREQ with MaxEnt and MaxEnt-FT, demonstrating its advantage in terms of efficiency and human effort reduction. However, the research also acknowledges limitations, such as reliance on simulations and potential instability due to high intervention variance, suggesting future work in offline or model-based reinforcement learning for enhanced performance and stability.

Key findings

8

Tables

2

Introduction
Background
Importance of human-robot interaction in imitation learning
Challenges with prior policy utilization in interactive learning
Objective
To develop a method that efficiently utilizes human preferences
Improve sample efficiency and policy alignment in interactive imitation learning
Method
Data Collection
Human demonstrations and expert trajectories
Prior policy execution data
Data Preprocessing
Residual reward function calculation
Feature extraction from expert and prior policy
Residual Q-Learning
Formulation of the Residual Q-Learning algorithm
Updating the policy based on the inferred reward function
Experiments
Simulated Tasks
Highway driving simulation
Bottle-pushing task
Performance comparison with baselines
Sample efficiency analysis
Real-World Tasks
Implementation and evaluation in physical environments
Human effort reduction
Feature distribution alignment
Comparison with Baselines
MEREQ vs. MaxEnt (Maximum Entropy Imitation Learning)
MEREQ vs. MaxEnt-FT (Fine-Tuning on MaxEnt)
Limitations and Future Work
Dependence on simulations and real-world challenges
Offline or model-based reinforcement learning as a direction for improvement
Addressing instability due to high intervention variance
Basic info
papers
robotics
machine learning
artificial intelligence
Advanced features