Related papers: Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

URL: http://arxiv.org/abs/2509.10687v2
Date: Tue, 04 Nov 2025 23:02:13 GMT
Title: Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation
Authors: Hao Zhang, Chun-Han Yao, Simon Donné, Narendra Ahuja, Varun Jampani,
Abstract summary: We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs.<n>Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts.
Score: 48.87022820000206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Related papers

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere [77.83037497484366]
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos.<n>4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics.
arXiv Detail & Related papers (2026-02-10T18:57:04Z)
Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation [76.21162972133534]
We represent a decomposed 4D scene with Freetime FeatureGS.<n>We design a streaming feature learning strategy to accurately recover it from per-image segmentation maps.<n> Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.
arXiv Detail & Related papers (2025-12-28T02:37:12Z)
SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting [50.69165364520998]
We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets.<n>Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction.<n>We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores.
arXiv Detail & Related papers (2025-12-03T23:05:01Z)
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer [21.55368174087611]
We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS.<n>Our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches.
arXiv Detail & Related papers (2025-11-30T00:18:46Z)
One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control [15.085082024859142]
One4D is a unified framework for 4D generation and reconstruction.<n>It produces dynamic 4D content as synchronized RGB frames and pointmaps.<n>One4D is trained on a mixture of synthetic and real 4D datasets under modest computational budgets.
arXiv Detail & Related papers (2025-11-24T09:31:23Z)
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z)
4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation [23.361360623083943]
We present 4DVD, a video diffusion model that generates 4D content in a decoupled manner.<n>To train 4DVD, we collect a dynamic 3D dataset called D-averse from a benchmark.<n>Experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation.
arXiv Detail & Related papers (2025-08-06T14:08:36Z)
PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers [29.52313100024294]
We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image.<n>PartCrafter simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes.<n> Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes.
arXiv Detail & Related papers (2025-06-05T20:30:28Z)
In-2-4D: Inbetweening from Two Single-View Images to 4D Generation [54.62824686338408]
We propose a new problem, In-between2-4D, for generative 4D (i.e., 3D + motion) in Splating from a minimalistic input setting.<n>Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D.
arXiv Detail & Related papers (2025-04-11T09:01:09Z)
Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z)
3D Part Segmentation via Geometric Aggregation of 2D Visual Features [57.20161517451834]
Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios.<n>Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts.<n>To address these limitations, we propose COPS, a COmprehensive model for Parts that blends semantics extracted from visual concepts and 3D geometry to effectively identify object parts.
arXiv Detail & Related papers (2024-12-05T15:27:58Z)
Semantic Dense Reconstruction with Consistent Scene Segments [33.0310121044956]
A method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks. First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone. A dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence.
arXiv Detail & Related papers (2021-09-30T03:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.