Related papers: GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio

GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio

URL: http://arxiv.org/abs/2602.20673v1
Date: Tue, 24 Feb 2026 08:22:42 GMT
Title: GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio
Authors: Hao Zhang, Lue Fan, Qitai Wang, Wenbo Li, Zehuan Wu, Lewei Lu, Zhaoxiang Zhang, Hongsheng Li,
Abstract summary: We present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories.<n>GA-Drive synthesizes novel pseudo-views using geometry information.<n>These pseudo-views are then transformed into photorealistic views using a trained video diffusion model.
Score: 62.07995406671134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

Related papers

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles [63.88996084630768]
Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation.<n>We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes.<n>Experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations.
arXiv Detail & Related papers (2026-02-24T20:03:47Z)
Visual Implicit Geometry Transformer for Autonomous Driving [7.795200422563638]
We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model.<n>ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements.<n>We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets.
arXiv Detail & Related papers (2026-02-05T11:54:38Z)
DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z)
Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry [41.904066758259624]
We introduce Vid-CamEdit, a novel framework for video camera trajectory editing.<n>Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry.
arXiv Detail & Related papers (2025-06-16T17:02:47Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
View-Centric Multi-Object Tracking with Homographic Matching in Moving UAV [43.37259596065606]
We address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios. Changes in the scene background not only render traditional frame-to-frame object IOU association methods ineffective but also introduce significant view shifts in the objects. We propose a novel universal HomView-MOT framework, which for the first time harnesses the view Homography inherent in changing scenes to solve MOT challenges.
arXiv Detail & Related papers (2024-03-16T06:48:33Z)
Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years. Data-driven simulation for autonomous driving has been a focal point of recent research. We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z)
Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view. Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.