Sheng-Hua Wan @ LAMDA, NJU-AI

wansh.jpg 

万盛华
Sheng-Hua Wan
Ph.D. candidate, LAMDA Group
School of Artificial Intelligence
Nanjing University, Nanjing 210023, China

Email: wansh [at] lamda.nju.edu.cn
Github: https://github.com/yixiaoshenghua
Google Scholar Personal zh-CN


Short Biography

I received my B.Sc. degree of GIS from Nanjing University, in June 2021. In the same year, I was admitted to study for a Ph.D. degree in Nanjing University without entrance examination in the LAMDA Group led by professor Zhi-Hua Zhou, under the supervision of Prof. De-Chuan Zhan.

Research Interests

My research interest includes Reinforcement Learning and its real-world applications, and mainly focus on sim2real problems:

Fundings

Learning World Models under Cross-Modality Observations. The Young Scientists Fund of the National Natural Science Foundation of China (PhD Candidate) (624B200197) 2025.01-2026.12

Policy Transfer via Cross-modality Imitation under the Sim2Real Gap. Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX24_0302) 2024.05-2025.05

Characteristics and mechanism of global flood teleconnection Frontier Science Center for Critical Earth Material Cycling - "GeoX" Project (2024300270) 2024.01-2024.12

Publications - Conference

WSFG 
  • Shenghua Wan, Le Gan, De-Chuan Zhan. Learning to Be Uncertain: Pre-training World Models with Horizon-Calibrated Uncertainty. In: The Fourteenth International Conference on Learning Representations (ICLR-2026), Rio de Janeiro, Brazil, 2026. [Paper] [Code]

  • Prevailing methods train models to predict a single, deterministic future, an objective that is ill-posed for inherently stochastic environments where actions are unknown. We contend that a world model should instead learn a structured, probabilistic representation of the future where predictive uncertainty correctly scales with the temporal horizon. To achieve this, we introduce a pre-training framework, Horizon-cAlibrated Uncertainty World Model (HAUWM), built on a probabilistic ensemble that predicts frames at randomly sampled future horizons.

WSFG 
  • Shaowei Zhang, Jiahan Cao, Dian Cheng, Xunlan Zhou, Shenghua Wan, Le Gan, De-Chuan Zhan. Leveraging Conditional Dependence for Efficient World Model Denoising: A Generative Modeling Perspective. In: Advances in Neural Information Processing Systems 38 (NeurIPS-2025), San Diego, California, USA, 2025. [Paper] [Code]

  • We introduce CsDreamer, a model-based RL approach built upon the world model of Collider-Structure Recurrent State-Space Model (CsRSSM). CsRSSM incorporates colliders to comprehensively model the denoising inference process and explicitly capture the conditional dependence. Furthermore, it employs a decoupling regularization to balance the influence of this conditional dependence. By accurately inferring a task-relevant state space, CsDreamer improves learning efficiency during rollouts.

WSFG 
  • Shenghua Wan, Xingye Xu, Le Gan, De-Chuan Zhan. Pre-training World Models from Videos with Generated Actions by the Multi-Modal Large Models. In Proceedings of the 20th Chinese Conference on Machine Learning (CCML-2025), Shanxi, China. [Paper] [Code]

  • Pre-training world models enhances sample efficiency in reinforcement learning, but existing methods struggle with capturing causal mechanisms due to the absence of explicit action labels in video data. We propose MAPO (Multimodal-large-model-generated Action-based pre-training from videOs), which introduces a novel framework that utilizes visual-language models to generate detailed semantic action descriptions, establishing action-state associations with causal explanations. Experimental results demonstrate that MAPO significantly improves performance on the DeepMind Control Suite and Meta-World, particularly in long-horizon tasks, underscoring the importance of semantic action generation for causal reasoning in world model training.

WSFG 
  • Wen-shu Fan, Shenghua Wan, Xin-chun Li, Hai-Hang Sun, Kaichen Huang, Le Gan, De-Chuan Zhan. Twice Learning Revitalizes Behavior Cloning. In Proceedings of the 20th Chinese Conference on Machine Learning (CCML-2025), Shanxi, China. [Paper] [Code]

  • In the imitation learning method of Behavior Cloning (BC), agents often take random actions when facing states not covered by expert data, leading to compounding errors that hinder performance. This paper presents Complete Behavior Cloning (CBC), an enhanced version of BC that aligns more comprehensively with expert knowledge while addressing these errors. Our experiments show that CBC reduces compounding errors, improves transferability, enhances robustness to noise, and decreases reliance on expert data, highlighting the effectiveness of twice learning in reinforcement learning.

WSFG 
  • Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan. FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making. In Proceedings of the 42nd International Conference on Machine Learning (ICML-2025), Montreal, Canada, 2025. [Paper] [Code]

  • we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended decision-making in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations.

WSFG 
  • Rui Yu, Shenghua Wan, Yucen Wang, Chen-Xiao Gao, Le Gan, Zongzhang Zhang, De-Chuan Zhan. Reward Models in Deep Reinforcement Learning: A Survey. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI-2025, survey Track), Montreal, Canada, 2025. [Paper]

  • In this survey, we provide a comprehensive review of reward modeling techniques within the RL literature. We begin by outlining the background and preliminaries in reward modeling. Next, we present an overview of recent reward modeling approaches, categorizing them based on the source, the mechanism, and the reward learning paradigm. Building on this understanding, we discuss various applications of these reward modeling techniques and review methods for evaluating reward models. Finally, we conclude by highlighting promising research directions in reward modeling. Altogether, this survey includes both established and emerging methods, filling the vacancy of a systematic review of reward models in current literature.

WSFG 
  • Kaichen Huang*, Shenghua Wan*, Minghao Shao, Shuai Feng, Le Gan, De-Chuan Zhan. Leveraging Separated World Model for Exploration in Visually Distracted Environments. In: Advances in Neural Information Processing Systems 37 (NeurIPS-2024), Vancouver, Canada, 2024. [Paper] [Code]

  • We propose a bi-level optimization framework named Separation-assisted eXplorer (SeeX). In the inner optimization, SeeX trains a separated world model to extract exogenous and endogenous information, minimizing uncertainty to ensure task relevance. In the outer optimization, it learns a policy on imaginary trajectories generated within the endogenous state space to maximize task-relevant uncertainty.

WSFG 
  • Shenghua Wan, Ziyuan Chen, Shuai Feng, Le Gan, De-Chuan Zhan. SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets. In Proceedings of the 41st International Conference on Machine Learning (ICML-2024), Vienna, Austria, 2024. [Paper] [Code] [Website]

  • We propose a new approach - Separated Model-based Offline Policy Optimization (SeMOPO) - decomposing states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO, and construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL).

WSFG 
  • Yucen Wang*, Shenghua Wan*, Le Gan, Shuai Feng, De-Chuan Zhan. AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors. In Proceedings of the 41st International Conference on Machine Learning (ICML-2024), Vienna, Austria, 2024. [Paper] [Code] [Website]

  • We propose Implicit Action Generator (IAG) to learn the implicit actions of visual distractors, and present a new algorithm named implicit Action-informed Diverse visual Distractors Distinguisher (AD3), that leverages the action inferred by IAG to train separated world models.

WSFG 
  • Sheng-hua Wan, Haihang Sun, Le Gan, De-chuan Zhan. MOSER: Learning Sensory Policy for Task-specific Viewpoint via View-conditional World Model. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-2024), Jeju, South Korea, 2024. [Paper] [Code]

  • We propose the View-conditional Markov Decision Process (VMDP) assumption and develop a new method, the MOdel-based SEnsor controlleR (MOSER), based on VMDP. MOSER jointly learns a view-conditional world model (VWM) to simulate the environment, a sensory policy to control the camera, and a motor policy to complete tasks.

WSFG 
  • Sheng-hua Wan, Yu-cen Wang, Ming-hao Shao, Ru-ying Chen, De-chuan Zhan. SeMAIL: Eliminating Distractors in Visual Imitation vis Separated Models. In Proceedings of the 40th International Conference on Machine Learning (ICML-2023), Honolulu, Hawaii, USA, 2023. [Paper] [Code]

  • We propose a new algorithm - named Separated Model-based Adversarial Imitation Learning (SeMAIL) - decoupling the environment dynamics into two parts by task-relevant dependency, which is determined by agent actions, and training separately.

Publications - Journal

WSFG 
  • Shenghua Wan, Xingye Xu, Le Gan, De-Chuan Zhan. Pre-training World Models from Videos with Generated Actions by the Multi-Modal Large Models. Computer Science. 2026, 53 (1): 39-50. doi:10.11896/jsjkx.250400064 [Paper]

  • Pre-training world models enhances sample efficiency in reinforcement learning, but existing methods struggle with capturing causal mechanisms due to the absence of explicit action labels in video data. We propose MAPO (Multimodal-large-model-generated Action-based pre-training from videOs), which introduces a novel framework that utilizes visual-language models to generate detailed semantic action descriptions, establishing action-state associations with causal explanations. Experimental results demonstrate that MAPO significantly improves performance on the DeepMind Control Suite and Meta-World, particularly in long-horizon tasks, underscoring the importance of semantic action generation for causal reasoning in world model training.

WSFG 
  • Wen-ye Wang, Sheng-hua Wan, Peng-feng Xiao, Xue-liang Zhang. A Novel Multi-Training Method for Time-Series Urban Green Cover Recognition From Multitemporal Remote Sensing Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2022), 15, 9531-9544. [Paper] [Code] (Completed at undergraduate)

  • We designed a general multitemporal framework to extract urban green cover using multi-training, a novel semi-supervised learning method for land cover classification on multitemporal remote sensing images.

Preprints

WSFG 
  • Shenghua Wan, Xiaohai Hu, Xunlan Zhou, Lei Yuan, Le Gan, De-Chuan Zhan. Multi-view Consistent Latent Action Learning for World Modeling and Control. [Paper] [Code]

  • We introduce MuCoLA (Multi-view Consistent Latent Action learning), a framework that learns robust, view-invariant action representations by enforcing semantic consistency across synchronized video streams. Departing from restrictive Gaussian priors, MuCoLA utilizes a Student-Teacher network with DINO-style self-distillation to align action representations across views, effectively filtering high-frequency visual noise while preserving motion semantics. Theoretical analysis reveals that our multi-view objective functions as a spectral filter, isolating agent dynamics from environmental nuisances.

WSFG 
  • Yuan-yih Shang, Shenghua Wan, Xiaohai Hu, Hongguang Shi, Kai Ming Ting, Lei Yuan, Le Gan, De-Chuan Zhan. Data Dependent Kernel-Aware Unsupervised Skill Discovery. [Paper] [Code]

  • Unsupervised Skill Discovery aims to learn a distinguishable, controllable, and broadly-covered skill pool via intra-skill consistency and inter-skill diversity. However, existing skill measures like temporal distance or mutual information ignore the local geometry and density structure of the state space. To this end, we propose Isolation Kernel-aware Skill Discovery (IKSD), introducing an Isolation Kernel in the hidden space to construct data-adaptive similarity and geometric scales to better distinguish cross-skill dynamics and improve learning stability. In addition, we propose skill evaluation metrics to measure skill cohesion and inter-skill separation, and estimating the coverage of state space without relying on downstream tasks.

WSFG 
  • Shaowei Zhang, Jiahan Cao, Xunlan Zhou, Shenghua Wan, De-Chuan Zhan. BRICKS-WM: Building Reusability via Interface Composition Kinetics for Structured World Models. [Paper] [Code]

  • We propose Building Reusability via Interface Composition Kinetics for Structured World Models (BRICKS-WM), a framework for the modular assembly of structured world models. We hypothesize that global dynamics can be decomposed into distinct subsystems interacting via shared protocols. As a minimal instantiation of this framework, we factorize the latent state space into an actuated Agent module and an external Background module, bridged by a learned latent interface. Distinct from prior object-centric methods that prioritize visual segmentation, BRICKS-WM enforces a functional separation in transition dynamics, ensuring that background physics remains agnostic to the agent's embodiment.

WSFG 
  • Qiang Wu, Shenghua Wan, Xiaohai Hu, Lei Yuan, Le Gan, De-Chuan Zhan. From Tools to Entities: Intrinsic Desire as the Foundation of General Agency. [Paper] [Code]

  • Rather than maximizing scalar rewards for specific tasks, agents must be driven by multi-dimensional, internal homeostatic needs that necessitate the continuous tracking of environmental variables. We propose that this teleological shift requires a corresponding structural evolution in memory architecture: moving from frozen weights to a stratified, plastic substrate. This framework transforms AI from reactive instruments into desire-driven entities capable of maintaining coherence and purpose without human intervention.

WSFG 
  • Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, Xiangkun Li, Shenghua Wan, Xiaohai Hu, Lei Yuan, Le Gan, De-Chuan Zhan. MARVL: Multi-Stage Guidance for Robotics Manipulation via Vision Language Models. [Paper] [Code]

  • While Vision–Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL—Multi-stage guidance for Robotic manipulation via Vision Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity.

WSFG 
  • Shenghua Wan, Yuanyi Shang, Shaowei Zhang, Xiaohai Hu, Xunlan Zhou, Lei Yuan, Le Gan, De-Chuan Zhan. Unsupervised Reinforcement Learning: A Survey and Open Problems. [Paper] [Code]

  • Unsupervised Reinforcement Learning (URL) allows agents to develop general behaviors and representations without relying on external rewards. This survey provides a clear overview of the field, organizing recent advances through a unified framework along four axes: learning paradigm, optimization objective, data regime, and agent quantity. We discuss important challenges, such as long-term exploration, transferring skills from simulation to real-world settings, and open-ended learning, while outlining a path for future research on developing autonomous agents.

WSFG 
  • Shenghua Wan, Ziyuan Chen, Le Gan, De-Chuan Zhan. FINE: Effective Model-based Imitation in Visually Noisy Environments via Uncertainty Reduction. [Paper] [Code]

  • We propose eFfective model-based Imitation in visually Noisy Environments (FINE), which incorporates expert demonstration-guided exogenous uncertainty reduction, endogenous model uncertainty-penalized reward estimation, and uncertainty-aware policy learning. We theoretically analyze the benefits of FINE’s design and derive a tight performance bound for imitation. Empirical results on visual imitation tasks validate both the superior performance of our method and its effectiveness in uncertainty reduction.

WSFG 
  • Kaichen Huang*, Hai-Hang Sun*, Shenghua Wan, Minghao Shao, Shuai Feng, Le Gan, De-Chuan Zhan. DIDA: Denoised Imitation Learning based on Domain Adaptation. [Paper] [Code]

  • We focus on the problem of Learning from Noisy Demonstrations (LND), where the imitator is required to learn from data with noise that often occurs during the processes of data collection or transmission. We propose Denoised Imitation learning based on Domain Adaptation (DIDA), which designs two discriminators to distinguish the noise level and expertise level of data, facilitating a feature encoder to learn task-related but domain-agnostic representations.

WSFG 
  • Kaichen Huang*, Minghao Shao*, Shenghua Wan, Hai-Hang Sun, Shuai Feng, Le Gan, De-Chuan Zhan. SENSOR: Imitate Third-Person Expert's Behaviors via Active Sensoring. [Paper] [Code]

  • We introduce active sensoring in the visual IL setting and propose a model-based SENSory imitatOR (SENSOR) to automatically change the agent's perspective to match the expert's. SENSOR jointly learns a world model to capture the dynamics of latent states, a sensor policy to control the camera, and a motor policy to control the agent.

WSFG 
  • Shaowei Zhang, Dian Cheng, Shenghua Wan, Xiaolong Yin, Lu Han, Shuai Feng, Le Gan, De-Chuan Zhan. Efficient Online Reinforcement Learning with Cross-Modality Offline Data. [Paper] [Code]

  • We propose CROss-MOdality Shared world model (CROMOS), a co-modality framework that trains an environmental dynamics model in the latent space by simultaneously aligning the source and target modality data to it for the subsequent policy training. We conduct a theoretical proof of the data utilization effectiveness and provide a practical implementation for our framework.

Selected Honors

MStar intern at Momenta, 2025.06-09

LAMDA Outstanding Contribution Award, 2024.12

National Scholarship for Doctoral Students, 2024.12

LAMDA Excellent Student Award, 2024.05

Ruli scholarship, 2023.11

Winner of the Ping An Insurance Data Mining Competition, 2021.12

Presidential Special Scholarship for first year Ph.D. Student in Nanjing University, 2021.09

Outstanding Graduate of Nanjing University, 2021.06

2-nd place in ZhongAn Cup Insurance Data Mining Competition, 2020.10

Academic Services

Reviewer of NeurIPS (2023, 2024, 2025), ICML (2024, 2025, 2026), ICLR (2026), AAAI (2026), PR, TNNLS, TETCI, CJE

Introduction to Machine Learning. (For undergraduate students, Spring, 2022, Teaching assistant)

Correspondence

Email: wansh [at] lamda.nju.edu.cn
Office: Yifu Building, Xianlin Campus of Nanjing University
Address: National Key Laboratory for Novel Software Technology, Nanjing University, Xianlin Campus Mailbox 603, 163 Xianlin Avenue, Qixia District, Nanjing 210023, China

(南京市栖霞区仙林大道163号, 南京大学仙林校区603信箱, 软件新技术国家重点实验室, 210023.)

14,332 Total Pageviews