Offline Imitation Learning with Model-based Reverse Augmentation

1Nanjing University, LAMDA Group
KDD 2024

The main challenges of offline imitation learning and our idea:
(a): The expert data is limited.
(b): The preferred action on the expert-unobserved states are uncertain.
(c): Model-based forward rollout on expert-unobserved states is difficult to exploit.
(d): In this work, we propose utilizing reverse rollout to generate trajectories and leading the agent transit from expert-unobserved states to expert-observed states.

Abstract

In offline Imitation Learning (IL), one of the main challenges is the covariate shift between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduced supplementary data and identified the latent expert-similar samples to augment the reliable samples during learning. Model-based solutions build forward dynamic models with conservatism quantification and then generate additional trajectories in the neighborhood of expert demonstrations. However, without reward supervision, these methods are often over-conservative in the out-of-expert-support regions, because only in states close to expert-observed states can there be a preferred action enabling policy optimization. To encourage more exploration on expert-unobserved states, we propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation (SRA). Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states in a self-paced style. Then, we use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states. This framework not only explores the expert-unobserved states but also guides maximizing long-term returns on these states, ultimately enabling generalization beyond the expert data. Empirical results show that our proposal could effectively mitigate the covariate shift and achieve state-of-the-art performance on the offline imitation learning benchmarks.

Promotional Video

Empirical Results

Comparison with Baselines

We evaluate SRA in D4L benchmark domains with 14 settings and provide the results.


Self-Paced Process

We conduct the visualization to demonstrate the learning process in the Maze2D-Medium environment, inculding te state-wise cumulative return of the learned policy $\pi$ and the corresponding sampled augmeted dataset.

Related Links

This work is motived by the following related works:

Offline Reinforcement Learning with Reverse Model-based Imagination (ROMI) introduces the reverse dynamic model into the offline RL community, whose VAE network is empolied in our work to learn the reverse policy.

Offline Imitation Learning without Auxiliary High-quality Behavior Data firstly introduces the idea into the offline IL that leads the agent from expert-unobserved states to the expert-observed states. Its model-free solution BCDP is empolied in our work as a RL pipeline.

BibTeX

@article{shao2024oil,
  author    = {Shao, Jie-Jing and Shi, Hao-Sen and Guo, Lan-Zhe, and Li, Yu-Feng},
  title     = {Offline Imitation Learning with Model-based Reverse Augmentation},
  journal   = {KDD},
  year      = {2024},
}