Related papers: WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

URL: http://arxiv.org/abs/2509.15130v2
Date: Sat, 27 Sep 2025 14:42:58 GMT
Title: WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang,
Abstract summary: WorldForge is a training-free, inference-time framework composed of three tightly coupled modules.<n>Our framework is plug-and-play and model-agnostic, enabling broad applicability across various 3D/4D tasks.
Score: 17.295532380360992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video diffusion models show immense potential for spatial intelligence tasks due to their rich world priors, but this is undermined by limited controllability, poor spatial-temporal consistency, and entangled scene-camera dynamics. Existing solutions, such as model fine-tuning and warping-based repainting, struggle with scalability, generalization, and robustness against artifacts. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. 1) Intra-Step Recursive Refinement injects fine-grained trajectory guidance at denoising steps through a recursive correction loop, ensuring motion remains aligned with the target path. 2) Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. 3) Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Our framework is plug-and-play and model-agnostic, enabling broad applicability across various 3D/4D tasks. Extensive experiments demonstrate that our method achieves state-of-the-art performance in trajectory adherence, geometric consistency, and perceptual quality, outperforming both training-intensive and inference-only baselines.

Related papers

Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training [27.251232052868033]
Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions.<n>Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation.
arXiv Detail & Related papers (2025-09-08T14:21:45Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation [59.33052312107478]
Event cameras offer possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes.<n>This paper presents EMove, a novel event-based framework that models-uniform trajectories via event-guided parametric curves.<n>For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance.<n>The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, flows and depth motion fields.
arXiv Detail & Related papers (2025-03-14T13:15:54Z)
Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.<n>Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.<n>We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z)
Physical Informed Driving World Model [47.04423342994622]
DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.<n>We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
arXiv Detail & Related papers (2024-12-11T14:29:35Z)
Driving View Synthesis on Free-form Trajectories with Generative Prior [39.24591650300784]
DriveX is a novel free-form driving view synthesis framework.<n>It distills generative prior into the 3D Gaussian model during its optimization.<n>It achieves high-quality view synthesis beyond recorded trajectories in real time.
arXiv Detail & Related papers (2024-12-02T17:07:53Z)
DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z)
4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives. To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z)
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models [41.006754386910686]
We argue that diffusion model itself allows decent control over the generated content without requiring any training. We introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation.
arXiv Detail & Related papers (2024-06-24T17:59:56Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion [83.88829943619656]
We introduce a method for generating realistic pedestrian trajectories and full-body animations that can be controlled to meet user-defined goals. Our guided diffusion model allows users to constrain trajectories through target waypoints, speed, and specified social groups. We propose utilizing the value function learned during RL training of the animation controller to guide diffusion to produce trajectories better suited for particular scenarios.
arXiv Detail & Related papers (2023-04-04T15:46:42Z)
An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds [50.19288542498838]
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. We introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective.
arXiv Detail & Related papers (2023-03-21T17:28:44Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.