FuguReport

Syn4D: A Multiview Synthetic 4D Dataset

Authors Zeren Jiang, Yushi Lan, Yihang Luo, Yufan Deng, Zihang Lai, Edgar Sucar, Christian Rupprecht, Iro Laina, Diane Larlus, Chuanxia Zheng, Andrea Vedaldi
Affiliations University of Oxford / NAVER / Nanyang Technological University
Categories Task / 3D Reconstruction / Dynamic scene reconstruction, Task / Tracking / 3D tracking of dynamic scenes, Application / Multiview Dataset / Synthetic multiview dynamic scene dataset
License CC BY 4.0

Abstract Overview

Syn4D is a large-scale synthetic multiview 4D dataset designed for dynamic scene understanding, reconstruction, and tracking. It contains 4.7K multiview video clips and 1.4M frames rendered in Unreal Engine, with annotations including camera motion, depth, point maps, dense 2D/3D tracking, instance segmentation, captions, and parametric human pose (SMPL-X). A defining feature is that any pixel can be mapped to its 3D position across time and across cameras, enabling dense multiview tracking for arbitrary points. The paper introduces an efficient barycentric-map-based representation for storing dense tracks and evaluates the dataset on geometry-aware novel-view synthesis, 4D reconstruction, 3D tracking, video depth estimation, and human pose estimation.

Novelty

The paper's main novelty is a publicly available multiview synthetic 4D dataset with dense and complete 3D tracking annotations for general dynamic scenes, which the authors describe as the first of its kind in this setting. A second distinctive contribution is an efficient dynamic point-map representation based on pixel-aligned barycentric coordinates and animated mesh sequences, reducing storage complexity from O(HWT³C²) to O(HWTC + VT) and making dense track storage and querying practical.

Results

In geometry-aware novel-view synthesis, training on Syn4D improved visual quality metrics (CLIP-V: 0.740 vs. 0.643, FVD: 452 vs. 631) and geometry metrics over a Kubric-trained counterpart on the authors' benchmark. Co-training 4RC with Syn4D improved 3D tracking (e.g., dense APD from 79.07 to 88.79), multi-view reconstruction, and video depth estimation across standard benchmarks. Fine-tuning MA-HMR with Syn4D yielded consistent but modest gains on Hi4D, CHI3D, and 3DPW compared to continued training without Syn4D.

Key Points

  1. Syn4D provides 4.7K multiview dynamic video clips (1.4M frames) with dense geometry supervision including camera parameters, depth, point maps, dense 2D/3D tracking, instance segmentation, and SMPL-X-based human pose annotations, using 1,674 animated Objaverse assets and 585 Bedlam2 humans placed in 30 Unreal Engine environments.
  2. A core technical contribution is an efficient representation of dense dynamic point tracks using per-pixel barycentric maps plus animated mesh vertex sequences, reducing storage from an infeasible O(HWT³C²) to a practical O(HWTC + VT).
  3. Co-training with Syn4D consistently improves state-of-the-art models: 4RC gains on 3D tracking, video depth estimation, camera pose estimation, and multi-view reconstruction, while MA-HMR achieves modest improvements on human pose estimation benchmarks without any architecture modifications.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.