Proximal Policy Distillation

Giacomo Spigler·July 21, 2024

Summary

Proximal Policy Distillation (PPD) is a novel policy distillation method that combines student-driven distillation and Proximal Policy Optimization (PPO) to enhance sample efficiency and leverage additional rewards collected by the student policy during distillation. PPD was evaluated across various reinforcement learning environments, including discrete actions and continuous control tasks, comparing it with two common alternatives: student-distill and teacher-distill. The study found that PPD improves sample efficiency and produces better student policies than typical policy distillation approaches. Moreover, PPD demonstrates greater robustness when distilling policies from imperfect demonstrations. The method's effectiveness was demonstrated through experiments on ATARI, Mujoco, and Procgen environments, using target student neural networks of varying sizes compared to the teacher network. The code for the paper is released as part of a new Python library, 'sb3-distill', built on top of stable-baselines3 to facilitate policy distillation. Proximal Policy Distillation (PPD) is a novel policy distillation method that combines student-driven distillation with Proximal Policy Optimization (PPO). It aims to improve sample efficiency and final performance by enabling the student to accelerate learning through distillation from a teacher, potentially surpassing the teacher's performance. PPD introduces a distillation loss to either perform traditional distillation or act as a skills prior. This method was evaluated against two common policy distillation methods, student-distill and teacher-distill, across various reinforcement learning environments with discrete and continuous action spaces, and out-of-distribution generalization. The evaluation considered different student network sizes compared to the teacher networks. PPD was also assessed for robustness in scenarios with 'imperfect teachers' whose parameters were artificially corrupted to decrease their performance. A new Python library, sb3-distill, was released to implement the three methods within the stable-baselines3 framework, enhancing access to useful policy distillation methods. The study discusses the performance of student models trained using three distillation methods (PPD, student-distill, and teacher-distill) with imperfect teachers. The results are presented as a fraction of the original teacher score, averaged by geometric mean, across four Atari and four Procgen environments, and onto larger student networks. The study reveals that increasing the value of λ speeds up distillation, as it gives more weight to the distillation loss. However, the impact of this hyperparameter is relatively small. The test-time evaluation of PPD students with different λ values shows that lower λ values allow students to learn better policies than the corresponding teacher, as high λ prioritizes the distillation loss against the PPO loss. The relative final test performance of students compared to the teacher is shown in parentheses in the legend of Figure 2. The study also highlights that most current policy distillation methods overlook the rewards obtained by the student during the distillation process, framing it solely as a supervised learning task. In conclusion, Proximal Policy Distillation (PPD) is a novel policy distillation method that combines student-driven distillation with Proximal Policy Optimization (PPO) to enhance sample efficiency and leverage additional rewards collected by the student policy during distillation. PPD was evaluated across various reinforcement learning environments, demonstrating its effectiveness in improving sample efficiency and producing better student policies compared to traditional policy distillation approaches. The method's robustness in distilling policies from imperfect demonstrations further highlights its potential in real-world applications. The release of the sb3-distill library provides a valuable tool for researchers and practitioners to implement and experiment with PPD and other policy distillation methods within the stable-baselines3 framework.

Key findings

Advanced features