Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

Zhe Li1*, Cheng Chi2*‡, Yangyang Wei3*, Boan Zhu4, Tao Huang5, Zhenguo Sun2, Yibo Peng2, Pengwei Wang2, Zhongyuan Wang2, Fangzhou Liu3, Chang Xu1 Shanghang Zhang2,6†,
1University of Sydney 2BAAI 3Harbin Institute of Technology 4Hong Kong University of Science and Technology 5Shanghai Jiao Tong University 6Peking University
* Equal Contribution     ‡ Project Lead     † Corresponding Author

Abstract

Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.

Methodology

RoboPerform makes humanoid perform as dancer and talker, which utilizes audio as signal to control humanoid locomotion, enabling poolicy to generate rhythm-aligned co-speech gestures and dance movements via input speech or music.


Overview of RoboPerform. We propose a two-stage approach: train an adaptor to inject kinematic information into audio modality, then a ∆MoE teacher policy is trained with RL and a diffusion-based student policy is trained to denoise actions conditioned on audio latent. We propose that motion=content+style. Thus, we fix the motion latent as a constant condition and leverage different audio signals as style modulation signals to generate actions adaptive to diverse rhythms.

Experiments

Quantitative Results:

Tracking Performance

Motion-Audio Alignment
Ablation Study on Adaptor
Ablation Study on ∆MoE
Ablation Study on Style Injection

Qualitative Results:

Simulation Performance
MLP Policy vs Diffusion Policy
T-SNE clustering on each expert of ∆MoE
Tracking Performance
Real-world Performance

BibTeX

@article{li2025roboperform,
    title={Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control},
    author={Zhe Li and Chengchi and Yangyang Wei and Boan Zhu and Tao Huang and Zhenguo Sun and Yibo Peng and Pengwei Wang and and Zhongyuan Wang and Fangzhou Liu and Chang Xu and Shanghang Zhang},
    journal={arXiv preprint arXiv:2510.14952},
    year={2025}
}