拡張機能をインストールして、あらゆる動画内を即座に検索しましょう

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
追加: 2026-05-10

454 回視聴14:13zerenjiang6402元のリリース: 2026-05-06

Mesh4D is a feed-forward diffusion-based model that reconstructs textured 3D meshes with dense correspondences from monocular dynamic video by encoding deformation fields in a compact latent space using a 4D variational autoencoder with spatiotemporal attention layers, and decoding through a deformation diffusion model that leverages video conditions (Dino features) and canonical shape conditions (3D shape VAE features) to achieve temporally consistent 4D reconstructions that outperform frame-wise methods in pose estimation and shape consistency.

[00:00:00]We present Mesh 4D, a feed forward model for moninocular 4D mesh reconstruction and tracking. Given a monocular objectentric dynamic video as input, we want to reconstruct a sequence of textured 3D meshes with dense correspondences.

[00:00:16]On the right, we show the 4D reconstruction output produced by our method seen from two different viewpoints.

[00:00:25]To train a diffusion-based reconstruction model, we first need to build a 4D variational autoenccoder that encodes the deformation field in a compact latent space. Given a sequence of 3D meshes as input, we first uniformly sample a sequence of corresponding points. Then we leverage position embedding and project it into higher dimension.

[00:00:48]We inject the skeleton information by using masked self and cross attention.

[00:00:53]Then a farthest point sampling FPS at spatial dimension is performed to compress the latent followed by eight layers of spatial temporal attention.

[00:01:03]The deformation field is decoded by layers of spatiotemporal attention followed by a cross attention where canonical vertices serve as query points.

[00:01:13]Each of our spatiotemporal attention layers sequentially performs temporal attention, global attention and spatial attention.

[00:01:23]The encoded deformation latent is used as target for the deformation diffusion model. We start from a noisy latent sampled from a Gaussian distribution.

[00:01:34]We add the temporal embedding and spatial embedding from the canonical point cloud. A skip connection is added in the DIT block to preserve the information from previous blocks.

[00:01:45]Video conditions are injected in the cross attention layer using Dino features.

[00:01:50]Canonical shape condition are also applied by the cross attention with its vector set feature extracted from hy 3D shape VAE.

[00:02:00]We first show results for novel view synthesis and tracking. We compare our approach with three state-of-the-art methods HY3D, GVDF and L4GM. Frame-wise reconstruction methods like HY3D produce inconsistent shape and texture. Our method avoids predicting extremely incorrect canonical mesh by leveraging a large-scale reconstruction method. All the state-of-the-art methods suffer from inaccurate pose estimation, for example, for the limbs of the mantis.

[00:02:32]This is due to a lack of temporal attention or to the absence of geometric supervision.

[00:02:38]Similar errors such as inconsistent shape and texture and inaccurate pose estimation also appear in this other example.

[00:02:46]3DGSbased methods occasionally exhibit ghost artifacts due to a lack of topological and geometrical constraints during training. Thanks to the geometric constraints, skeleton information, and spatial temporal attention, Mesh 4D is able to reconstruct accurate pose and geometry and produces temporally consistent novel view video to better assess the quality of the estimated geometry.

[00:03:12]We also visualize the normal maps. Even if the canonical mesh is not exactly accurate, we can still reconstruct physical plausible and reasonable motion.

[00:03:23]On the contrary, the HY3D model often produces inconsistent shapes even if we apply shared noise to different frames.

[00:03:33]Here we show more reconstruction results. Our method generalizes well to different types of dynamic objects including human, humanoids, animals, and monsters.

[00:03:55]Our model can also generalize very well to in the wild moninocular videos. Here we show some reconstructed results from the in the wild consistent 4D data set.

[00:04:11]Thank you for watching.

関連おすすめ

コンピュータサイエンス

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

コンピュータサイエンス

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

コンピュータサイエンス

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

コンピュータサイエンス

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

コンピュータサイエンス

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

コンピュータサイエンス

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

コンピュータサイエンス

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

コンピュータサイエンス

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

トレンド

コンピュータサイエンス

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29