Mesh4D is a feed-forward diffusion-based model that reconstructs textured 3D meshes with dense correspondences from monocular dynamic video by encoding deformation fields in a compact latent space using a 4D variational autoencoder with spatiotemporal attention layers, and decoding through a deformation diffusion model that leverages video conditions (Dino features) and canonical shape conditions (3D shape VAE features) to achieve temporally consistent 4D reconstructions that outperform frame-wise methods in pose estimation and shape consistency.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video追加:
We present Mesh 4D, a feed forward model for moninocular 4D mesh reconstruction and tracking. Given a monocular objectentric dynamic video as input, we want to reconstruct a sequence of textured 3D meshes with dense correspondences.
On the right, we show the 4D reconstruction output produced by our method seen from two different viewpoints.
To train a diffusion-based reconstruction model, we first need to build a 4D variational autoenccoder that encodes the deformation field in a compact latent space. Given a sequence of 3D meshes as input, we first uniformly sample a sequence of corresponding points. Then we leverage position embedding and project it into higher dimension.
We inject the skeleton information by using masked self and cross attention.
Then a farthest point sampling FPS at spatial dimension is performed to compress the latent followed by eight layers of spatial temporal attention.
The deformation field is decoded by layers of spatiotemporal attention followed by a cross attention where canonical vertices serve as query points.
Each of our spatiotemporal attention layers sequentially performs temporal attention, global attention and spatial attention.
The encoded deformation latent is used as target for the deformation diffusion model. We start from a noisy latent sampled from a Gaussian distribution.
We add the temporal embedding and spatial embedding from the canonical point cloud. A skip connection is added in the DIT block to preserve the information from previous blocks.
Video conditions are injected in the cross attention layer using Dino features.
Canonical shape condition are also applied by the cross attention with its vector set feature extracted from hy 3D shape VAE.
We first show results for novel view synthesis and tracking. We compare our approach with three state-of-the-art methods HY3D, GVDF and L4GM. Frame-wise reconstruction methods like HY3D produce inconsistent shape and texture. Our method avoids predicting extremely incorrect canonical mesh by leveraging a large-scale reconstruction method. All the state-of-the-art methods suffer from inaccurate pose estimation, for example, for the limbs of the mantis.
This is due to a lack of temporal attention or to the absence of geometric supervision.
Similar errors such as inconsistent shape and texture and inaccurate pose estimation also appear in this other example.
3DGSbased methods occasionally exhibit ghost artifacts due to a lack of topological and geometrical constraints during training. Thanks to the geometric constraints, skeleton information, and spatial temporal attention, Mesh 4D is able to reconstruct accurate pose and geometry and produces temporally consistent novel view video to better assess the quality of the estimated geometry.
We also visualize the normal maps. Even if the canonical mesh is not exactly accurate, we can still reconstruct physical plausible and reasonable motion.
On the contrary, the HY3D model often produces inconsistent shapes even if we apply shared noise to different frames.
Here we show more reconstruction results. Our method generalizes well to different types of dynamic objects including human, humanoids, animals, and monsters.
Our model can also generalize very well to in the wild moninocular videos. Here we show some reconstructed results from the in the wild consistent 4D data set.
Thank you for watching.
関連おすすめ
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











