Related papers: LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

URL: http://arxiv.org/abs/2512.17445v1
Date: Fri, 19 Dec 2025 10:57:03 GMT
Title: LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents
Authors: Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem,
Abstract summary: LangDriveCTRL is a framework for editing real-world driving videos to synthesize diverse traffic scenarios.<n>It supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction.
Score: 61.91651123290512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.

Related papers

LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization [49.945233586949286]
LoVoRA is a novel framework for mask-free video object removal and addition.<n>Our approach integrates image-to-video translation, optical flow-based mask propagation, and videopainting, enabling temporally consistent edits.<n>LoVoRA achieves end-to-end video editing without requiring external control signals during inference.
arXiv Detail & Related papers (2025-12-02T17:01:07Z)
TGT: Text-Grounded Trajectories for Locally Controlled Video Generation [33.989722489622075]
We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions.<n>TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches.
arXiv Detail & Related papers (2025-10-16T19:45:27Z)
InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing [6.95116998047811]
InstructUDrag is a diffusion-based framework that combines text instructions with object dragging.<n>Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches.<n>InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
arXiv Detail & Related papers (2025-10-09T13:06:49Z)
Neural Atlas Graphs for Dynamic Scene Decomposition and Editing [32.587200006985015]
We propose a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas.<n>NAGs achieve state-of-the-art quantitative results on the Open dataset.
arXiv Detail & Related papers (2025-09-19T18:24:41Z)
DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes [49.23098808629567]
DrivingGaussian++ is an efficient framework for realistic reconstructing and controllable editing of autonomous driving scenes.<n>It supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation.<n>Our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process.
arXiv Detail & Related papers (2025-08-28T16:22:54Z)
DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes [23.215760822443194]
DriveEditor is a diffusion-based framework for object editing in driving videos.<n>It offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion.
arXiv Detail & Related papers (2024-12-27T04:49:36Z)
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing [114.14164860467227]
We propose EditRoom, a framework capable of executing a variety of layout edits through natural language commands.<n>Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes.<n>We have developed an automatic pipeline to augment existing 3D scene datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs.
arXiv Detail & Related papers (2024-10-03T17:42:24Z)
MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing. It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z)
Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications. Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges) We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.