LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
- URL: http://arxiv.org/abs/2411.12980v3
- Date: Sat, 22 Feb 2025 16:03:34 GMT
- Title: LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
- Authors: Siwen Jiao, Yangyi Fang, Baoyun Peng, Wangqun Chen, Bharadwaj Veeravalli,
- Abstract summary: We introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving.<n>LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception.<n>It optimize spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis.
- Score: 4.534832757549232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis to focus on motion-related features, thereby boosting computational efficiency. The core of LaVida Drive consists of two modules: the \textit{Query-aware Token Selection} module and the \textit{Spatial-Temporal Token Recovery and Enhancement} module. The former dynamically selects the most relevant visual tokens based on semantic alignment with the input query, reducing the token count from high-resolution spatial input. The latter ensures smooth and coherent interactions between spatial and temporal information, preserving contextual continuity across frames. Extensive experiments on various autonomous driving question-answering benchmarks show that LaVida Drive significantly reduces visual tokens, enhances efficiency, and improves overall performance.
Related papers
- SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.
We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.
With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving [10.984203470464687]
Vision-language models (VLMs) often suffer from limitations such as inadequate spatial perception and hallucination.
We propose a retrieval-augmented decision-making (RAD) framework to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes.
We fine-tune VLMs on a dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities.
arXiv Detail & Related papers (2025-03-18T03:25:57Z) - DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models [5.858709357808136]
Visual Question Answering (VQA) models execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance.
Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles.
This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment.
arXiv Detail & Related papers (2025-03-14T10:19:24Z) - Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
General MLLMs combined with CLIP often struggle to represent driving-specific scenarios accurately.
We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements.
These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning.
arXiv Detail & Related papers (2024-11-20T06:58:33Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.
Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.
Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions [13.981748780317329]
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs)
This study introduces a novel accident anticipation framework for AVs, termed CRASH.
It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion.
Our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA)
arXiv Detail & Related papers (2024-07-25T04:12:49Z) - Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles.
To understand map and route context, we employ a context encoder to extract context maps.
The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z) - Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes [70.08318779492944]
We are the first to harness vanishing point (VP) priors for more effective segmentation.
Our novel, efficient network for VSS, named VPSeg, incorporates two modules that utilize exactly this pair of static and dynamic VP priors.
arXiv Detail & Related papers (2024-01-27T01:01:58Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain
Adaptation [152.60469768559878]
SHIFT is the largest multi-task synthetic dataset for autonomous driving.
It presents discrete and continuous shifts in cloudiness, rain and fog intensity, time of day, and vehicle and pedestrian density.
Our dataset and benchmark toolkit are publicly available at www.vis.xyz/shift.
arXiv Detail & Related papers (2022-06-16T17:59:52Z) - CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous
Driving Tasks [11.489187712465325]
An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world.
Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data.
This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation.
arXiv Detail & Related papers (2022-05-18T04:15:42Z) - NEAT: Neural Attention Fields for End-to-End Autonomous Driving [59.60483620730437]
We present NEural ATtention fields (NEAT), a novel representation that enables efficient reasoning for imitation learning models.
NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics.
In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert.
arXiv Detail & Related papers (2021-09-09T17:55:28Z) - Self-Supervised Pillar Motion Learning for Autonomous Driving [10.921208239968827]
We propose a learning framework that leverages free supervisory signals from point clouds and paired camera images to estimate motion purely via self-supervision.
Our model involves a point cloud based structural consistency augmented with probabilistic motion masking as well as a cross-sensor motion regularization to realize the desired self-supervision.
arXiv Detail & Related papers (2021-04-18T02:32:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.