Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning

1Nanjing University, 2Tencent Robotics X


Visual Reinforcement Learning (RL) is a promising approach to achieve human-like intelligence. However, it currently faces challenges in learning efficiently within noisy environments. In contrast, humans can quickly identify task-relevant objects in distraction-filled surroundings by applying previously acquired common knowledge. Recently, foundational models in natural language processing and computer vision have achieved remarkable successes, and the common knowledge within these models can significantly benefit downstream task training. Inspired by these achievements, we aim to incorporate common knowledge from foundational models into visual RL. We propose a novel Focus-Then-Decide (FTD) framework, allowing the agent to make decisions based solely on task-relevant objects. To achieve this, we introduce an attention mechanism to select task-relevant objects from the object set returned by a foundational segmentation model, and only use the task-relevant objects for the subsequent training of the decision module. Additionally, we specifically employed two generic self-supervised objectives to facilitate the rapid learning of this attention mechanism. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that our method can quickly and accurately pinpoint objects of interest in noisy environments. Consequently, it achieves a significant performance improvement over current state-of-the-art algorithms.


FTD adpots a two-stage paradigm to solve the problem of learning with visual distractions.

In the focus stage, segmentation model will first process the distracted observation into batch of segments, including both task-relevant and task-irrelevant segments. Then the attention selector will filter out task-relvant parts for the decision stage.

In the decision stage, traditional RL algorithms are applied to the selected frame. The network is updated by integrating the losses derived from both the RL process and two self-supervised objectives.



Experiments are conducted on two environments, respectively Deepmind Control and Franka Emika Robotics. To simulate the real condition that agent is trained in a natural scene with varies task-irrelevant distractions, we replace the static background with frames from a large netural RGB video dataset.

Six baselines are selected for comparison, including representation methods, model-based methods, and data-augmentation methods. FTD achieves the best performence in six out of nine tasks.



The selected frames of FTD and the reconstruction frames of Denoised-MDP are plotted for comparison. The selected frames of FTD are more precise and thus have better interpretability.



    author={Chen, Chao and Xu, Jiacheng and Liao, Weijian and Ding, Hao and Zhang, Zongzhang and Yu, Yang and Zhao, Rui},
    title={Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning}, 
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},