A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Coupling what is seen, what is done, and what is felt during real human interaction with articulated objects.

1ETH Zurich · 2TU Munich · 3U. Freiburg · 4Microsoft · 5U. Bonn
Hoi! dataset teaser
3048
Sequences
381
Articulated Objects
38
Environments
4
Embodiments

Abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments — (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper — where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.

The Dataset

4 Manipulation Schemes

Hoi! gripper, human hand, hand with wrist-camera, and UMI gripper — enabling cross-embodiment research.

Multi-View Capture

Egocentric (Aria glasses), manipulation-centric (wrist/gripper), and exocentric (iPhone RGB-D) viewpoints.

Force & Tactile Sensing

Synchronized force/torque and DIGIT tactile sensing through the custom Hoi! gripper.

Spatial Alignment

All recordings registered to a common frame via Leica laser scans, with articulated and static states.

381 Articulated Objects

Drawers, cabinets, dishwashers, fridges — across 38 real-world kitchens, bathrooms, and bedrooms.

Temporally Aligned

Nanosecond-resolution timestamps across all recording modules within each session.

Paper Figures

Key figures from the paper. See the full paper for all figures and details.

Recording pipeline
Recording Pipeline. The four manipulation schemes and recording modules used for capture.
Dataset statistics
Dataset Statistics. Distribution of environments and articulated interaction categories in the Hoi! dataset.
Force profiles
Force Profiles. Force/torque measurements during articulated object manipulation.

Explore Samples

RGB image sequences from each recording modality and a 3D Leica point cloud. Select a scene to explore.

Scene:
Human Hand
Human Hand view
Hoi! Gripper
Hoi! Gripper view
Exocentric (iPhone)
Exocentric view
UMI Gripper
UMI Gripper view

Leica Point Cloud — Bedroom

Click and drag to rotate · Scroll to zoom

Dataset Documentation

Loading documentation…

Download

Full dataset and code coming soon. Example data available now.

Team

Citation

@article{engelbracht2025hoi,
  title={Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation},
  author={Engelbracht, Tim and Zurbrügg, René and Wohlrapp, Matteo and Büchner, Martin and Valada, Abhinav and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
  journal={arXiv preprint arXiv:2512.04884},
  year={2025}
}
ETH Zurich TU Munich University of Freiburg University of Bonn Microsoft