RoboPerform

Abstract

Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.

Methodology

RoboPerform makes humanoid perform as dancer and talker, which utilizes audio as signal to control humanoid locomotion, enabling poolicy to generate rhythm-aligned co-speech gestures and dance movements via input speech or music.

Overview of RoboPerform. We propose a two-stage approach: train an adaptor to inject kinematic information into audio modality, then a ∆MoE teacher policy is trained with RL and a diffusion-based student policy is trained to denoise actions conditioned on audio latent. We propose that motion=content+style. Thus, we fix the motion latent as a constant condition and leverage different audio signals as style modulation signals to generate actions adaptive to diverse rhythms.

Experiments

Quantitative Results:

Tracking Performance

Motion-Audio Alignment

Ablation Study on Adaptor

Ablation Study on ∆MoE

Ablation Study on Style Injection

Qualitative Results:

Simulation Performance

MLP Policy vs Diffusion Policy

T-SNE clustering on each expert of ∆MoE

Tracking Performance

Real-world Performance

BibTeX

@article{li2025roboperform,
    title={Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control},
    author={Zhe Li and Chengchi and Yangyang Wei and Boan Zhu and Tao Huang and Zhenguo Sun and Yibo Peng and Pengwei Wang and and Zhongyuan Wang and Fangzhou Liu and Chang Xu and Shanghang Zhang},
    journal={arXiv preprint arXiv:2510.14952},
    year={2025}
}