DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
- URL: http://arxiv.org/abs/2602.22549v1
- Date: Thu, 26 Feb 2026 02:42:14 GMT
- Title: DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
- Authors: Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu,
- Abstract summary: Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions for conditional scene generation.<n>These methods suffer from insufficient details in both semantic and structural aspects.<n>We propose DrivePTS, which incorporates three key innovations.
- Score: 8.8362637812626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
Related papers
- StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation [57.06461272772509]
StdGEN++ is a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs.<n>It achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement.<n>The resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking.
arXiv Detail & Related papers (2026-01-12T15:41:27Z) - SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z) - ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States [9.721009445297716]
ArtGen is a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics.<n>Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency.<n>A compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships.
arXiv Detail & Related papers (2025-12-13T17:00:03Z) - Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality [52.57416398859353]
We show that causal minimality can endow latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control.<n>We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables.<n>These causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
arXiv Detail & Related papers (2025-12-11T14:59:14Z) - OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields [25.81679730373062]
We propose OpenHype, a novel approach that represents scene hierarchies using a continuous hyperbolic latent space.<n>By leveraging the properties of hyperbolic geometry, OpenHype naturally encodes multi-scale relationships.<n>Our method outperforms state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2025-10-24T13:17:56Z) - Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition [9.411542547451193]
We propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning.<n>Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density.<n>With the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird's-eye view and 3D segment perspectives.
arXiv Detail & Related papers (2025-06-17T07:04:07Z) - HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment [22.960492450413497]
We propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses.<n> HF-VTON consists of three key modules: the Appearance-Preserving Warp Alignment Module, the Semantic Representation Module, and the Multimodal Prior-Guided Appearance Generation Generation Module.<n> Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS.
arXiv Detail & Related papers (2025-05-26T07:55:49Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments [73.80718037070773]
We present the multi-modal Pedestrian-Focused Scene dataset, rigorously annotated in semi-structured scenes with the format of nuScenes.<n>We also propose a novel Hybrid Multi-Scale Fusion Network (HMFN) to detect pedestrians in densely populated and occluded scenarios.
arXiv Detail & Related papers (2025-02-21T09:57:53Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Semantically Adversarial Scenario Generation with Explicit Knowledge
Guidance [24.09547181095033]
We introduce a method to incorporate domain knowledge explicitly in the generation process to achieve the Semantically Adversarial Generation (SAG)
By imposing semantic rules on the properties of nodes and edges in the tree structure, explicit knowledge integration enables controllable generation.
Our method efficiently identifies adversarial driving scenes against different state-of-the-art 3D point cloud segmentation models.
arXiv Detail & Related papers (2021-06-08T02:51:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.