A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Coupling what is seen, what is done, and what is felt during real human interaction with articulated objects.

Tim Engelbracht¹, René Zurbrügg¹, Matteo Wohlrapp², Martin Büchner³, Abhinav Valada³, Marc Pollefeys^1,4, Hermann Blum⁵, Zuria Bauer¹

¹ETH Zurich · ²TU Munich · ³U. Freiburg · ⁴Microsoft · ⁵U. Bonn

Paper (ArXiv) Code — Soon Dataset Example Data CAD Files — Soon

3048

Sequences

381

Articulated Objects

Environments

Embodiments

Abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments — (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper — where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.

The Dataset

4 Manipulation Schemes

Hoi! gripper, human hand, hand with wrist-camera, and UMI gripper — enabling cross-embodiment research.

Multi-View Capture

Egocentric (Aria glasses), manipulation-centric (wrist/gripper), and exocentric (iPhone RGB-D) viewpoints.

Force & Tactile Sensing

Synchronized force/torque and DIGIT tactile sensing through the custom Hoi! gripper.

Spatial Alignment

All recordings registered to a common frame via Leica laser scans, with articulated and static states.

381 Articulated Objects

Drawers, cabinets, dishwashers, fridges — across 38 real-world kitchens, bathrooms, and bedrooms.

Temporally Aligned

Nanosecond-resolution timestamps across all recording modules within each session.

The Hoi! Gripper

The Hoi! gripper is a custom end-effector designed to bridge human and robotic manipulation. Worn like a handheld tool, it captures aligned force/torque and tactile sensing alongside egocentric video — making it the primary instrumented embodiment in the dataset.

F/T Sensing — 6-axis force/torque at the end-effector
Tactile Sensing — DIGIT sensor for contact texture and pressure
Stereo Camera — manipulation-centric stereo camera aligned with interaction

CAD Files — Soon

drag to rotate · scroll to zoom

Paper Figures

Key figures from the paper. See the full paper for all figures and details.

Recording Pipeline. The four manipulation schemes and recording modules used for capture.

Dataset Statistics. Distribution of environments and articulated interaction categories in the Hoi! dataset.

Force Profiles. Force/torque measurements during articulated object manipulation.

Explore Samples

RGB image sequences from each recording modality and a 3D Leica point cloud. Select a scene to explore.

Scene:

Human Hand

Hoi! Gripper

Exocentric (iPhone)

UMI Gripper

Leica Point Cloud — Bedroom

Click and drag to rotate · Scroll to zoom

Dataset Documentation

Loading documentation…

Download the Dataset

Loading dataset index…

Team

Contributors

Kavya Shankar

U. Bonn

Citation

@InProceedings{Engelbracht_2026_CVPR,
    author    = {Engelbracht, Tim and Zurbrügg, René and Wohlrapp, Matteo and Büchner, Martin and Valada, Abhinav and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
    title     = {Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {8880-8890}
}

License

The Hoi! dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, including commercially, as long as you give appropriate credit.