Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
- URL: http://arxiv.org/abs/2602.06159v2
- Date: Mon, 09 Feb 2026 11:30:59 GMT
- Title: Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
- Authors: Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou, Yizhi Zhang, Jianan He, Yanfeng Zhang, Mingwei Sun, Zengmao Wang, Zhen Dong, Xiaoxiao Long, Liqiu Meng,
- Abstract summary: We present Driving with DINO (DwD), a novel framework for autonomous driving video generation.<n>We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure.<n>To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking"
- Score: 36.98878302668877
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
Related papers
- AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation [57.06461272772509]
StdGEN++ is a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs.<n>It achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement.<n>The resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking.
arXiv Detail & Related papers (2026-01-12T15:41:27Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution [0.8122270502556375]
Real-world image super-resolution must handle complex degradations and inherent reconstruction ambiguities.<n>One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts.<n>We propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance.
arXiv Detail & Related papers (2025-10-22T06:06:01Z) - Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior [22.323789227447755]
Fog, low-light, and motion blur degrade image quality and pose threats to the safety of autonomous driving.
This work proposes a novel Deep Channel Prior (DCP) for degraded visual recognition.
Based on this, a novel plug-and-play Unsupervised Feature Enhancement Module (UFEM) is proposed to achieve unsupervised feature correction.
arXiv Detail & Related papers (2024-04-02T07:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.