A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Coupling what is seen, what is done, and what is felt during real human interaction with articulated objects.

Tim Engelbracht¹, René Zurbrügg¹, Matteo Wohlrapp², Martin Büchner³, Abhinav Valada³, Marc Pollefeys^1,4, Hermann Blum⁵, Zuria Bauer¹

¹ETH Zurich · ²TU Munich · ³U. Freiburg · ⁴Microsoft · ⁵U. Bonn

Paper (ArXiv) Code — Soon Dataset — Soon Example Data

3048

Sequences

381

Articulated Objects

Environments

Embodiments

Abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments — (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper — where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.

The Dataset

4 Manipulation Schemes

Hoi! gripper, human hand, hand with wrist-camera, and UMI gripper — enabling cross-embodiment research.

Multi-View Capture

Egocentric (Aria glasses), manipulation-centric (wrist/gripper), and exocentric (iPhone RGB-D) viewpoints.

Force & Tactile Sensing

Synchronized force/torque and DIGIT tactile sensing through the custom Hoi! gripper.

Spatial Alignment

All recordings registered to a common frame via Leica laser scans, with articulated and static states.

381 Articulated Objects

Drawers, cabinets, dishwashers, fridges — across 38 real-world kitchens, bathrooms, and bedrooms.

Temporally Aligned

Nanosecond-resolution timestamps across all recording modules within each session.

Paper Figures

Key figures from the paper. See the full paper for all figures and details.

Recording Pipeline. The four manipulation schemes and recording modules used for capture.

Dataset Statistics. Distribution of environments and articulated interaction categories in the Hoi! dataset.

Force Profiles. Force/torque measurements during articulated object manipulation.

Explore Samples

RGB image sequences from each recording modality and a 3D Leica point cloud. Select a scene to explore.

Scene:

Human Hand

Hoi! Gripper

Exocentric (iPhone)

UMI Gripper

Leica Point Cloud — Bedroom

Click and drag to rotate · Scroll to zoom

Dataset Documentation

Loading documentation…

Download

Full dataset and code coming soon. Example data available now.

Example Data Full Dataset — Soon Code — Soon

Team

Citation

@article{engelbracht2025hoi,
  title={Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation},
  author={Engelbracht, Tim and Zurbrügg, René and Wohlrapp, Matteo and Büchner, Martin and Valada, Abhinav and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
  journal={arXiv preprint arXiv:2512.04884},
  year={2025}
}