This video presents CARI4D, a method for reconstructing 4D human-object interactions with accurate contacts from single monocular RGB video without assuming any template shape. The approach leverages foundation models for shape reconstruction, pose estimation, and scene understanding, then aligns predictions and learns a category-agnostic model that reasons about interactions. The method addresses challenges including unknown shape, depth, scale ambiguity, and dynamic motion with occlusion. Evaluation on BEHAVE and UNSEEN INTERCAP datasets shows 38% and 36% improvement over previous methods, demonstrating superior performance in shape reconstruction, pose tracking, and contact accuracy compared to template-based methods like VIST Tracker and template-free approaches like Inner Track.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
CARI4D: Category Agnostic 4D Reconstruction of Human-Object InteractionAdded:
Welcome. We present Kerry 4D category agnostic 4D reconstruction of human object interaction.
The goal of this project is to reconstruct 4D human object interaction with accurate contacts from single monoccular RGB video as input. We do not assume any template shape but reconstruct them from input video.
This is a very challenging problem due to unknown shape depth scale ambiguity and dynamic motion with occlusion.
Previous method vist tracker simplifies the problem by assuming ground truth object template. This produces consistent translation and good mesh but it requires template and is limited to specific object instances. Recent method inner track is template free but it can only reconstruct point clouds and produce inconsistent translation. It still works only for limited object categories. What we want is a method that is template free, has consistent innerframe translation, and generalizes beyond fixed categories.
Our key idea is first leveraging foundation models in shape reconstruction, pose estimation, and scene understanding.
This is however non-trivial as the predictions lie in different space and they can suffer from noisy input. Our solution is to align the predictions and learn a model that reasons interaction but is category agnostic.
This is an overview of our method. We start with initialization using foundation models and then learn a model to reason contacts and refine interaction poses which are then used to further improve the realism of the context via optimization. Please see our paper for more details.
We evaluate our method on the behave and unseen intercap data set. Our method outperforms previous methods by 38% in behave and 36% in unseen intercap data set. We now show video comparisons.
Interra can only predict point clouds as output and the predicted shapes and poses are noisy. In contrast, our method reconstructs good shape at metric scale and tracks the pose and contacts accurately.
We now compare with vist tracker on behave. We send our reconstructed object mesh as the template for vist tracker, but it still predicts noisy poses. On the contrary, our method can reconstruct good shape and tracks the interactions accurately across the full video.
We now compare in the unseen intercap data set.
Note that all models were not trained on this data set. Pico retrieves contacts from predefined database which can be noisy leading to inaccurate optimized poses. Our method reconstruct the shape and contacts on the fly coherently.
We now visualize the top- down view. It can be seen that image-based method pico always predicts human and object at the same location while our method produces consistent translation across the all video frames.
We also compare with vist tracker on inner cap. We input our reconstructed mesh to vist tracker yet it struggles to generalize uses to unseen instances while our method generalizes much more stable.
We now show results on in the wild internet videos. Pico queries object from predefined contact database which can be very limited and the queried object as well as contacts are inaccurate. In contrast, our method produces correct object shape reconstruction and more stable results.
We compare with inner track in this video. It outputs point clouds that look plausible in front view but are inconsistent in 3D. Our method reconstruct accurate shapes as meshes and tracks the poses and contacts accurately.
We now compare with vist tracker. It fails to generalize to the unseen object from in the wild. While our method generalizes well in both shape reconstruction and interaction post tracking.
We show another comparison with Vist tracker. It predicts completely flipped object pose. Notice how accurately our method reconstructs the metric scale object, human poses, object poses, and contacts from just moninocular RGB video.
Thank you for watching. Our code and pre-trained models will be publicly released.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsโข2026-05-28
How agent o11y differs from traditional o11y โ Phil Hetzel, Braintrust
aiDotEngineer
450 viewsโข2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation๐ฏโ
LearnwithSahera
1K viewsโข2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsโข2026-05-29
Search Algorithms Explained in 60 Seconds! ๐ค๐จ
samarthtuliofficial
218 viewsโข2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsโข2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsโข2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 viewsโข2026-06-01











