From Language To Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance

Zhe Li1*, Cheng Chi2*‡, Yangyang Wei3, Boan Zhu4, Yibo Peng2, Tao Huang5, Pengwei Wang2, Zhongyuan Wang2, Shanghang Zhang2,6†, Chang Xu1
1University of Sydney 2BAAI 3Harbin Institute of Technology 4Hong Kong University of Science and Technology 5Shanghai Jiao Tong University 6Peking University
* Equal Contribution     ‡ Project Lead     † Corresponding Author

Abstract

Language plays a vital role in the realm of human motion. Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer–diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision–language–action humanoid systems.

Methodology

RoboGhost is a retargeting-free latent driven policy for language-guided humanoid locomotion. By removing the dependency on motion retargeting, it thus allows robots to be controlled directly via open-ended language commands. The figure showcases (a) the previous pipeline with motion retargeting, (b) our proposed retargeting-free latent-driven pipeline, (c) quantitative comparisons of success rate and time cost between baseline and RoboGhost, (d) performing the backflip, and (e) dancing and leaping forward.


Overview of RoboGhost. We propose a two-stage approach: a motion latent is first generated, then a MoE-based teacher policy is trained with RL and a diffusion-based student policy is trained to denoise actions conditioned on motion latent. This latent-driven scheme bypasses the need for motion retargeting.

Experiments

Quantitative Results:

Tracking Performance

Text-to-Motion Performance

Qualitative Results:

Simulation Performance
Real-world Performance
Generated Motion Visualization

BibTeX

@article{li2025roboghost,
    title={From Language To Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance},
    author={Zhe Li and Chengchi and Yangyang Wei and Boan Zhu and Yibo Peng and Tao Huang and Pengwei Wang and and Zhongyuan Wang and Shanghang Zhang and Chang Xu},
    journal={arXiv preprint arXiv:2510.14952},
    year={2025}
}