HumanoidUMI

Experiment Results

Cluttered tabletop pick-and-place

A Robot-Free Bridge to Humanoid Whole-Body Skills

HumanoidUMI provides a robot-free data collection and learning framework for humanoid whole-body skills. Human demonstrations are collected with portable VR-UMI devices and represented as sparse task-relevant spatial keypoints. A high-level policy predicts future keypoint motions and gripper actions, Spatial Keypoint Retargeting converts them into feasible humanoid whole-body references, and a low-level whole-body controller executes the motion with balance and stability.

Overview of the full HumanoidUMI pipeline, from robot-free human demonstration capture to deployable humanoid whole-body execution.

Abstract

High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose HumanoidUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. HumanoidUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across five real-world scenarios.

Hardware: Robot-Free Data Acquisition

The data acquisition platform combines a Pico 4-based motion capture setup — two foot-mounted trackers and one waist-mounted tracker — with two instrumented grippers, each carrying a fisheye camera.

The system synchronously records three multimodal streams:

Wrist-view images from the fisheye cameras on each gripper.
Whole-body keypoint states obtained via the Pico SDK and aligned to the robot frame.
Gripper aperture measured directly from the motor's magnetic encoder.

These heterogeneous streams are jointly used to train the high-level policy and, at deployment, to drive the robot in real time — without ever requiring access to the physical humanoid during data collection.

Method Overview

HumanoidUMI formulates humanoid visuomotor control as a three-stage hierarchy:

High-level policy: A diffusion-based policy infers task-space keypoint trajectories and gripper commands from wrist-view images and partial proprioception.
Spatial Keypoint Retargeting (SKR): Maps these commands to a robot-native 36-dimensional motion representation — root position, root orientation, and 29 joint angles.
Low-level controller: A whole-body controller tracks the retargeted motion using proprioceptive feedback, enabling stable humanoid execution from robot-free demonstrations.

High-Level Diffusion Policy

We instantiate the high-level policy as a whole-body extension of Diffusion Policy that operates on a sparse task-space rather than the full joint configuration.

Observation: Left/right wrist-view RGB images are encoded by DINOv2 and fused with a 15-dimensional lower-body proprioceptive state and the diffusion step into a single global condition.
Action (47-D): Five keypoints — pelvis, left/right TCP, and left/right foot — each represented as a 3-D translation and a 6-D continuous rotation, plus two scalar gripper widths.
Temporal structure: At each step, the policy predicts a receding-horizon chunk of length H = 48. Supervision targets are encoded in each keypoint's own local frame, removing dependence on the world frame at recording time and improving cross-episode generalization.

At inference, the denoised, normalized chunk is converted back to absolute SE(3) targets and forwarded to SKR for whole-body retargeting.

SKR: Preserving Metric Spatial Structure Across Embodiments

In robot-free settings, conventional retargeting methods (e.g., GMR) apply global or local rescaling to bridge the human–robot embodiment gap, which often distorts metric spatial information in the demonstration.

Spatial Keypoint Retargeting (SKR) takes a different route: it represents motion using five task-relevant keypoints — pelvis, left/right TCP, and left/right foot — and only scales the vertical pelvis-to-foot distance to compensate for height differences.

All other metric spatial relationships are preserved, allowing the retargeted motion to maintain the geometric structure of the original demonstration.

SKR also serves as the closed-loop interface between the high-level policy and the low-level controller:

State estimation: Joint states are read from the Unitree SDK, and forward kinematics is performed with the pelvis as the reference frame to compute current keypoint poses.
Target construction: The current state is combined with high-level predictions to form desired keypoint targets.
Inverse kinematics: The system solves a whole-body IK problem using mink (a MuJoCo-based solver) to produce joint-level commands.
Control loop: This process runs at every control step, forming a continuous closed-loop execution.

Experiments

We evaluate HumanoidUMI on real-world humanoid manipulation tasks using a Unitree G1 robot. The experiments test whether robot-free VR-UMI demonstrations can become deployable policies, whether sparse keypoints can coordinate hands, feet, and waist, and whether robot-free collection improves data throughput over teleoperation.

Real-world evaluation and ablation analysis on three humanoid manipulation tasks

Robot-free demonstrations transfer to deployable manipulation skills. HumanoidUMI is evaluated on cluttered tabletop pick-and-place, bimanual vegetable collection, and dynamic ball-shooting. These tasks cover single-arm visual grasping, coordinated bimanual manipulation, and precisely timed dynamic release.

The full pipeline consistently outperforms ablated variants. Replacing Spatial Keypoint Retargeting with GMR reduces success on the manipulation tasks, while removing latency matching hurts dynamic shooting, showing that spatially grounded retargeting and synchronized sensing-control timing are both critical.

Real-world whole-body evaluation and knee-keypoint ablation

Sparse keypoints support whole-body coordination. Under-table waste disposal requires grasping, stepping backward, knee and torso bending, and reaching into a constrained workspace. Walking coffee delivery requires forward locomotion, stable stopping, and handover.

The knee-keypoint ablation shows that lower-body keypoints are important when waist-leg coordination is substantial. Without them, lower-body posture is underconstrained, reducing the reliability of body lowering, stepping, and stable reaching.

BibTeX

@misc{wang2026humanoidumibridgingrobotfreedemonstrations,
      title={HumanoidUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation}, 
      author={Hongwu Wang and Chenhao Yu and Youhao Hu and Jiachen Zhang and Yuanyuan Li and Shaqi Luo},
      year={2026},
      eprint={2606.27239},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.27239}, 
    }

Join Us

We are looking for passionate and self-motivated students to join us!
Our research focuses on Vision-Language-Action (VLA) and Whole-Body Control (WBC).
Internship and research collaboration opportunities are available.

📬 Send your resume to: wowhongwu@163.com, luoshaqi@163.com