RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation
- URL: http://arxiv.org/abs/2509.16500v1
- Date: Sat, 20 Sep 2025 02:23:36 GMT
- Title: RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation
- Authors: Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng-zhong Xu, Jianbing Shen,
- Abstract summary: We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data.<n>We introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models.<n> RLGF substantially reduces geometric errors (e.g., VP error by 21%, Depth error by 57%) and dramatically improves 3D object detection mAP by 12.7%, narrowing the gap to real-data performance.
- Score: 75.61028930882144
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12.7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.
Related papers
- ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting [63.138778159026934]
We propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO.<n> ERGO dynamically estimates the view-specific excess risk and adaptively adjust loss weights during optimization.<n>Experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.
arXiv Detail & Related papers (2026-02-10T20:44:43Z) - Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation [34.46015478321541]
VideoGPA is a data-efficient self-supervised framework to automatically derive dense preference signals.<n>It steers the generative distribution toward inherent 3D consistency without requiring human annotations.<n>It significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs.
arXiv Detail & Related papers (2026-01-30T18:59:57Z) - UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating [20.395441875640955]
We propose a unified 3D indoor detection framework called UniGeo.<n>To model geometric relations in scenes, we first propose a geometry-aware learning module that establishes a learnable mapping from spatial relationships to feature weights.<n>Then, to further enhance point cloud feature representation, we propose a dynamic channel gating mechanism that leverages learnable channel-wise weighting.
arXiv Detail & Related papers (2026-01-30T06:15:50Z) - EGG-Fusion: Efficient 3D Reconstruction with Geometry-aware Gaussian Surfel on the Fly [8.803716785929936]
EGG-Fusion is a novel differentiable-rendering-based real-time reconstruction system.<n>The proposed system achieves a surface reconstruction error of 0.6textitcm, representing over 20% improvement in accuracy compared to state-of-the-art methods.<n> Notably, the system maintains real-time processing capabilities at 24 FPS, establishing it as one of the most accurate differentiable-rendering-based real-time reconstruction systems.
arXiv Detail & Related papers (2025-12-01T05:32:17Z) - Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting [95.61137026932062]
Intern-GS is a novel approach to enhance the process of sparse-view Gaussian splatting.<n>We show that Intern-GS achieves state-of-the-art rendering quality across diverse datasets.
arXiv Detail & Related papers (2025-05-27T05:17:49Z) - Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis [22.767866875051013]
We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address limitations of current methods.<n>By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones.<n> Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins.
arXiv Detail & Related papers (2025-04-02T17:59:46Z) - MonoGSDF: Exploring Monocular Geometric Cues for Gaussian Splatting-Guided Implicit Surface Reconstruction [84.07233691641193]
We introduce MonoGSDF, a novel method that couples primitives with a neural Signed Distance Field (SDF) for high-quality reconstruction.<n>To handle arbitrary-scale scenes, we propose a scaling strategy for robust generalization.<n>Experiments on real-world datasets outperforms prior methods while maintaining efficiency.
arXiv Detail & Related papers (2024-11-25T20:07:07Z) - GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions [22.077366472693395]
We introduce a new generative approach for synthesizing 3D geometry and images from single-view collections.
By employing volumetric rendering using neural radiance fields, they inherit a key limitation: the generated geometry is noisy and unconstrained.
We propose GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner.
arXiv Detail & Related papers (2024-06-06T17:00:10Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Self-supervised Learning for Enhancing Geometrical Modeling in 3D-Aware
Generative Adversarial Network [42.16520614686877]
3D-GANs exhibit artifacts in their 3D geometrical modeling, such as mesh imperfections and holes.
These shortcomings are primarily attributed to the limited availability of annotated 3D data.
We present a Self-Supervised Learning technique tailored as an auxiliary loss for any 3D-GAN.
arXiv Detail & Related papers (2023-12-19T04:55:33Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.