Related papers: 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

URL: http://arxiv.org/abs/2512.17012v2
Date: Mon, 22 Dec 2025 03:08:53 GMT
Title: 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Authors: Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen,
Abstract summary: 4D-RGPT is a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception.<n>P4D is a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception.<n>R4D-Bench is a benchmark for depth-aware dynamic scenes with region-level prompting.
Score: 78.63581010756023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Related papers

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere [77.83037497484366]
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos.<n>4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics.
arXiv Detail & Related papers (2026-02-10T18:57:04Z)
Any4D: Unified Feed-Forward Metric 4D Reconstruction [39.62006179006032]
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction.<n>Any4D directly generates per-pixel motion and geometry predictions for N frames.<n>We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster)
arXiv Detail & Related papers (2025-12-11T18:57:39Z)
C4D: 4D Made from 3D through Dual Correspondences [77.04731692213663]
We introduce C4D, a framework that leverages temporal correspondences to extend existing 3D reconstruction formulation to 4D.<n>C4D captures two types of correspondences: short-term optical flow and long-term point tracking.<n>We train a dynamic-aware point tracker that provides additional mobility information.
arXiv Detail & Related papers (2025-10-16T17:59:06Z)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z)
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes [65.76371201992654]
We propose a novel 4D reconstruction benchmark, WideRange4D.<n>This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods.<n>We also introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks.
arXiv Detail & Related papers (2025-03-17T17:58:18Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
We present textbf4DGen, a novel framework for grounded 4D content creation.<n>Our pipeline facilitates controllable 4D generation, enabling users to specify the motion via monocular video or adopt image-to-video generations.<n>Compared to existing video-to-4D baselines, our approach yields superior results in faithfully reconstructing input signals.
arXiv Detail & Related papers (2023-12-28T18:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.