OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

These observations suggest that, without direct and physically grounded interaction feedback, teleoperation users rely mainly on visual cues and tend to overcompensate at contact, which leads to inefficient demonstrations and severe force fluctuations. By contrast, OmniUMI enables more natural contact modulation and produces smoother, more human-like interaction data. This result supports the claim that the handheld embodiment and interaction-centered sensing design improve human alignment during demonstration.

Abstract

UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.

OmniUMI System Design

Collection--Deployment Consistency via a Motorized Gripper.

A major challenge in UMI-style systems lies in collection--deployment consistency. Even in vision-centric settings, aligning the interface used for demonstration with the hardware used for deployment is already nontrivial. Once internal and external force, tactile sensing, and depth are introduced, this mismatch becomes substantially more severe because the observations no longer depend only on RGB appearance and gripper width, but also on contact mechanics and sensing embodiment.

To reduce this gap, OmniUMI adopts a motorized quick-swap gripper that can be reused across both collection and deployment. Reusing the same gripper preserves not only the geometry of the end-effector, but also the interaction-related sensing context attached to it. This significantly reduces alignment effort and makes multimodal observations more consistent across the two stages. In this sense, the motorized gripper is not merely a hardware convenience; it is a system design choice aimed at preserving physically grounded multimodal consistency.

Motorized Quick-Switch Structure

Exploded System View

Exploded view of the OmniUMI system design.

Hardware overview of OmniUMI. The system is organized around a unified handheld embodiment that integrates the custom hub, fisheye and depth sensing, 6-axis force sensing, motion tracking, quick-release mechanism, and tactile-compatible gripper structure. The same gripper-side embodiment is reused across data collection and deployment to improve collection--deployment consistency.

Tasks

Task 1: Wipe Whiteboard 🧽

We design a whiteboard erasing task in which the end-effector must remove red drawings from a flat whiteboard surface while maintaining continuous sliding contact. Compared with simple surface tracing, this task imposes a more realistic and interaction-intensive requirement, since successful erasing depends not only on trajectory tracking but also on maintaining appropriate external contact force throughout the motion.

Task 2: Pick-Place Water 🫙

The robot initially grasps two transparent plastic cups placed upside down and nested together. During execution, the policy is required to gradually loosen the gripper until the inner cup is released and drops, while the outer cup remains stably grasped. This task is intentionally designed to operate under very small gripping forces, making it a suitable benchmark for evaluating whether tactile sensing can support subtle contact-state discrimination and precise release control beyond what vision alone can provide.

Task 3: Drop Cup 🥛

We perform a pick-and-place task using a relatively heavy object, namely a full unopened bottle of mineral water. This task requires sufficiently large and well-regulated internal grasping force to securely lift, transport, and place the object without slippage, making it an appropriate benchmark for load-sensitive grasp control.

BibTeX

@misc{luo2026omniumiphysicallygroundedrobot,
      title={OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction}, 
      author={Shaqi Luo and Yuanyuan Li and Youhao Hu and Chenhao Yu and Chaoran Xu and Jiachen Zhang and Guocai Yao and Tiejun Huang and Ran He and Zhongyuan Wang},
      year={2026},
      eprint={2604.10647},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.10647}, 
}

Join Us

We are looking for passionate and self-motivated students to join us!
Our research focuses on Vision-Language-Action (VLA) and Whole-Body Control (WBC).
Internship and research collaboration opportunities are available.

📬 Send your resume to: luoshaqi@163.com