Install our extension to search inside any video instantly.

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
Added: 2026-05-27

424 views35:04axyz2929Original Release: 2026-05-23

This video presents CARI4D, a method for reconstructing 4D human-object interactions with accurate contacts from single monocular RGB video without assuming any template shape. The approach leverages foundation models for shape reconstruction, pose estimation, and scene understanding, then aligns predictions and learns a category-agnostic model that reasons about interactions. The method addresses challenges including unknown shape, depth, scale ambiguity, and dynamic motion with occlusion. Evaluation on BEHAVE and UNSEEN INTERCAP datasets shows 38% and 36% improvement over previous methods, demonstrating superior performance in shape reconstruction, pose tracking, and contact accuracy compared to template-based methods like VIST Tracker and template-free approaches like Inner Track.

[00:00:00]Welcome. We present Kerry 4D category agnostic 4D reconstruction of human object interaction.

[00:00:07]The goal of this project is to reconstruct 4D human object interaction with accurate contacts from single monoccular RGB video as input. We do not assume any template shape but reconstruct them from input video.

[00:00:24]This is a very challenging problem due to unknown shape depth scale ambiguity and dynamic motion with occlusion.

[00:00:33]Previous method vist tracker simplifies the problem by assuming ground truth object template. This produces consistent translation and good mesh but it requires template and is limited to specific object instances. Recent method inner track is template free but it can only reconstruct point clouds and produce inconsistent translation. It still works only for limited object categories. What we want is a method that is template free, has consistent innerframe translation, and generalizes beyond fixed categories.

[00:01:09]Our key idea is first leveraging foundation models in shape reconstruction, pose estimation, and scene understanding.

[00:01:19]This is however non-trivial as the predictions lie in different space and they can suffer from noisy input. Our solution is to align the predictions and learn a model that reasons interaction but is category agnostic.

[00:01:32]This is an overview of our method. We start with initialization using foundation models and then learn a model to reason contacts and refine interaction poses which are then used to further improve the realism of the context via optimization. Please see our paper for more details.

[00:01:52]We evaluate our method on the behave and unseen intercap data set. Our method outperforms previous methods by 38% in behave and 36% in unseen intercap data set. We now show video comparisons.

[00:02:08]Interra can only predict point clouds as output and the predicted shapes and poses are noisy. In contrast, our method reconstructs good shape at metric scale and tracks the pose and contacts accurately.

[00:02:29]We now compare with vist tracker on behave. We send our reconstructed object mesh as the template for vist tracker, but it still predicts noisy poses. On the contrary, our method can reconstruct good shape and tracks the interactions accurately across the full video.

[00:02:47]We now compare in the unseen intercap data set.

[00:02:51]Note that all models were not trained on this data set. Pico retrieves contacts from predefined database which can be noisy leading to inaccurate optimized poses. Our method reconstruct the shape and contacts on the fly coherently.

[00:03:06]We now visualize the top- down view. It can be seen that image-based method pico always predicts human and object at the same location while our method produces consistent translation across the all video frames.

[00:03:20]We also compare with vist tracker on inner cap. We input our reconstructed mesh to vist tracker yet it struggles to generalize uses to unseen instances while our method generalizes much more stable.

[00:03:36]We now show results on in the wild internet videos. Pico queries object from predefined contact database which can be very limited and the queried object as well as contacts are inaccurate. In contrast, our method produces correct object shape reconstruction and more stable results.

[00:03:55]We compare with inner track in this video. It outputs point clouds that look plausible in front view but are inconsistent in 3D. Our method reconstruct accurate shapes as meshes and tracks the poses and contacts accurately.

[00:04:13]We now compare with vist tracker. It fails to generalize to the unseen object from in the wild. While our method generalizes well in both shape reconstruction and interaction post tracking.

[00:04:31]We show another comparison with Vist tracker. It predicts completely flipped object pose. Notice how accurately our method reconstructs the metric scale object, human poses, object poses, and contacts from just moninocular RGB video.

[00:04:54]Thank you for watching. Our code and pre-trained models will be publicly released.

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Computer Science

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29