Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
- URL: http://arxiv.org/abs/2509.06723v2
- Date: Thu, 11 Sep 2025 12:56:31 GMT
- Title: Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
- Authors: Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li,
- Abstract summary: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions.<n>Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation.
- Score: 27.251232052868033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
Related papers
- From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting [26.57713792657793]
We propose a motion-adaptive framework that aligns control density with motion complexity.<n>We show significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.
arXiv Detail & Related papers (2025-10-03T05:33:58Z) - DiTraj: training-free trajectory control for video diffusion transformer [34.05715460730871]
Trajectory control represents a user-friendly task in controllable video generation.<n>We propose DiTraj, a training-free framework for trajectory control in text-to-video generation tailored for DiT.<n>Our method outperforms previous methods in both video quality and trajectory controllability.
arXiv Detail & Related papers (2025-09-26T03:53:31Z) - WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance [17.295532380360992]
WorldForge is a training-free, inference-time framework composed of three tightly coupled modules.<n>Our framework is plug-and-play and model-agnostic, enabling broad applicability across various 3D/4D tasks.
arXiv Detail & Related papers (2025-09-18T16:40:47Z) - T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates [29.598249500198904]
Trajectory-Guided Generative Video Coding (dubbed TGVC) bridges low-level motion tracking with high-level semantic understanding.<n>Our framework achieves more precise motion control than existing text-guided methods.
arXiv Detail & Related papers (2025-07-10T11:01:58Z) - GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control [50.67481583744243]
We introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models.<n>We propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles.<n>Our method significantly outperforms existing models in both action accuracy and 3D spatial awareness.
arXiv Detail & Related papers (2025-05-28T14:46:51Z) - EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation [59.33052312107478]
Event cameras offer possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes.<n>This paper presents EMove, a novel event-based framework that models-uniform trajectories via event-guided parametric curves.<n>For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance.<n>The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, flows and depth motion fields.
arXiv Detail & Related papers (2025-03-14T13:15:54Z) - An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.<n>Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - Driving View Synthesis on Free-form Trajectories with Generative Prior [39.24591650300784]
DriveX is a novel free-form driving view synthesis framework.<n>It distills generative prior into the 3D Gaussian model during its optimization.<n>It achieves high-quality view synthesis beyond recorded trajectories in real time.
arXiv Detail & Related papers (2024-12-02T17:07:53Z) - T-3DGS: Removing Transient Objects for 3D Scene Reconstruction [83.05271859398779]
Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions.<n>We propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting.
arXiv Detail & Related papers (2024-11-29T07:45:24Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - An Effective Motion-Centric Paradigm for 3D Single Object Tracking in
Point Clouds [50.19288542498838]
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving.
Current approaches all follow the Siamese paradigm based on appearance matching.
We introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective.
arXiv Detail & Related papers (2023-03-21T17:28:44Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.