A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Coupling what is seen, what is done, and what is felt during real human interaction with articulated objects.
Abstract
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments — (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper — where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.
The Dataset
4 Manipulation Schemes
Hoi! gripper, human hand, hand with wrist-camera, and UMI gripper — enabling cross-embodiment research.
Multi-View Capture
Egocentric (Aria glasses), manipulation-centric (wrist/gripper), and exocentric (iPhone RGB-D) viewpoints.
Force & Tactile Sensing
Synchronized force/torque and DIGIT tactile sensing through the custom Hoi! gripper.
Spatial Alignment
All recordings registered to a common frame via Leica laser scans, with articulated and static states.
381 Articulated Objects
Drawers, cabinets, dishwashers, fridges — across 38 real-world kitchens, bathrooms, and bedrooms.
Temporally Aligned
Nanosecond-resolution timestamps across all recording modules within each session.
Paper Figures
Key figures from the paper. See the full paper for all figures and details.
Explore Samples
RGB image sequences from each recording modality and a 3D Leica point cloud. Select a scene to explore.
Leica Point Cloud — Bedroom
Click and drag to rotate · Scroll to zoom
Dataset Documentation
Download
Full dataset and code coming soon. Example data available now.
Team
Citation
@article{engelbracht2025hoi,
title={Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation},
author={Engelbracht, Tim and Zurbrügg, René and Wohlrapp, Matteo and Büchner, Martin and Valada, Abhinav and Pollefeys, Marc and Blum, Hermann and Bauer, Zuria},
journal={arXiv preprint arXiv:2512.04884},
year={2025}
}





