RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Zhe Li1*, Cheng Chi2*‡, Boan Zhu3*, Yangyang Wei4, Shuanghao Bai5, Yuheng Ji6, Yibo Peng2, Tao Huang7, Pengwei Wang2, Zhongyuan Wang2, S.-H. Gary Chan3, Chang Xu1 Shanghang Zhang2,8†,
1University of Sydney 2BAAI 3Hong Kong University of Science and Technology 4Harbin Institute of Technology 5Xian Jiao Tong University 6Chinese Academy of Sciences 7Shanghai Jiao Tong University 8Peking University
* Equal Contribution     ‡ Project Lead     † Corresponding Author

Abstract

Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate RoboMirror’s effectiveness, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80\%, and achieves a 3.7\% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.

Methodology

RoboMirror makes humanoid understand before imitating. It acts like a mirror, which can not only infer and replicate the actions being performed by the shooter from egocentric videos based on the changes in the surrounding environmental perspective (as shown in the upper part of the figure), but also understand the actions first and then imitate them from third-person videos (as shown in the lower part of the figure), without the need for pose estimation and retargeting during inference.


Overview of RoboMirror. It adopts a two-stage framework: initially, it leverages Qwen3-VL to process egocentric or third-person video inputs, generating motion latents through diffusion models with DiT. Subsequently, in the policy learning stage, a MoE-based teacher policy is trained with RL, while a diffusion-based student policy learns to denoise actions under the guidance of reconstructed motion latents. During inference, it can first understand and then imitate the motion in the video without motion obtainment and retargeting.

Experiments

Quantitative Results:

Tracking Performance

Text-to-Motion Performance
Ablation Study on Different VLMs
Pose-driven vs Latent-driven
Alignment vs Reconstruction

Qualitative Results:

Simulation Performance
MLP Policy vs Diffusion Policy
Real-world Performance
Generated Motion Visualization

BibTeX

@article{li2025robomirror,
    title={RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion},
    author={Zhe Li and Chengchi and Boan Zhu and Yangyang Wei and Shuanghao Bai and Yuheng Ji and Yibo Peng and Tao Huang and Pengwei Wang and and Zhongyuan Wang and S.-H. Gary Chan and Chang Xu and Shanghang Zhang},
    journal={arXiv preprint arXiv:2510.14952},
    year={2025}
}