TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
- URL: http://arxiv.org/abs/2510.15104v1
- Date: Thu, 16 Oct 2025 19:45:27 GMT
- Title: TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
- Authors: Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Bo Liu, Yiding Yang, Guang Chen, Longyin Wen, Alan Yuille, Chongyang Ma,
- Abstract summary: We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions.<n>TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches.
- Score: 33.989722489622075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.
Related papers
- LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents [61.91651123290512]
LangDriveCTRL is a framework for editing real-world driving videos to synthesize diverse traffic scenarios.<n>It supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction.
arXiv Detail & Related papers (2025-12-19T10:57:03Z) - TTOM: Test-Time Optimization and Memorization for Compositional Video Generation [102.55214293086863]
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios.<n>We introduce Test-Time and Memo spatializationr (TTOM) to align VFMs with video layouts during inference for better text-image alignment.<n>We find that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization.
arXiv Detail & Related papers (2025-10-09T08:37:00Z) - DiTraj: training-free trajectory control for video diffusion transformer [34.05715460730871]
Trajectory control represents a user-friendly task in controllable video generation.<n>We propose DiTraj, a training-free framework for trajectory control in text-to-video generation tailored for DiT.<n>Our method outperforms previous methods in both video quality and trajectory controllability.
arXiv Detail & Related papers (2025-09-26T03:53:31Z) - Versatile Transition Generation with Image-to-Video Diffusion [89.67070538399457]
We present a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions.<n>We show that VTG achieves superior transition performance consistently across all four tasks.
arXiv Detail & Related papers (2025-08-03T10:03:56Z) - LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs [3.6016438645365834]
We present a framework called LLM_Control to address the challenges of controllable T2I generation task.<n>By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models.<n>We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes.
arXiv Detail & Related papers (2025-07-26T12:57:02Z) - DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation [14.34140569565309]
DyST-XL is a framework that enhances off-the-shelf text-to-video models through frame-aware control.<n>The code is released in https://github.com/XiaoBuL/DyST-XL.
arXiv Detail & Related papers (2025-04-21T11:41:22Z) - FonTS: Text Rendering with Typography and Style Controls [12.717568891224074]
This paper proposes a two-stage DiT-based pipeline to address problems by enhancing controllability over typography and style in text rendering.<n>We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method with enclosing typography control tokens (ETC-tokens)<n>To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency.
arXiv Detail & Related papers (2024-11-28T16:19:37Z) - InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models [84.71887272654865]
We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
arXiv Detail & Related papers (2023-11-28T16:33:08Z) - End-to-End Video Text Spotting with Transformer [86.46724646835627]
We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
arXiv Detail & Related papers (2022-03-20T12:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.