DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
- URL: http://arxiv.org/abs/2503.11265v1
- Date: Fri, 14 Mar 2025 10:19:24 GMT
- Title: DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
- Authors: Xirui Zhou, Lianlei Shan, Xiaolin Gui,
- Abstract summary: Visual Question Answering (VQA) models execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance.<n>Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles.<n>This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment.
- Score: 5.858709357808136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) models, which fall under the category of vision-language models, conventionally execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance. Although this approach aids in concentrating on salient features and diminishing computational burden, it incurs the loss of vital detailed information, a drawback that is particularly damaging in end-to-end autonomous driving scenarios. Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles, all of which are crucial for safe navigation. This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment, potentially escalating the risk of accidents. To tackle this problem, we put forward the Dynamic Resolution Vision Language Model (DynRsl-VLM). DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT). Moreover, we devise a novel image-text alignment module to replace the Q-Former, enabling simple and efficient alignment with text when dealing with dynamic resolution image inputs. Our method enhances the environmental perception capabilities of autonomous driving systems without overstepping computational constraints.
Related papers
- Autoregressive High-Order Finite Difference Modulo Imaging: High-Dynamic Range for Computer Vision Applications [3.4956406636452626]
High dynamic range (gressive) imaging is vital for capturing the full range of light tones in scenes, essential for computer vision tasks such as autonomous driving.
Standard commercial imaging systems face limitations in capacity for well depth, and quantization precision, hindering their HDR capabilities.
We develop a modulo analog-to-digital approach that resets signals upon saturation, enabling estimation of pixel resets through neighboring pixel intensities.
arXiv Detail & Related papers (2025-04-05T16:41:15Z) - DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision.<n>We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions.<n>Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z) - Scalable and Explainable Verification of Image-based Neural Network Controllers for Autonomous Vehicles [3.2540854278211864]
Existing formal verification methods for image-based neural network controllers in autonomous vehicles often struggle with high-dimensional inputs, computational inefficiency, and a lack of explainability.<n>We propose textbfSEVIN, a framework that leverages a Variational Autoencoders (VAE) to encode high-dimensional images into a lower-dimensional, explainable latent space.<n>Our approach also incorporates robustness verification under real-world perturbations by augmenting the dataset and retraining the VAE to capture environmental variations.
arXiv Detail & Related papers (2025-01-23T16:46:45Z) - LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement [4.534832757549232]
We introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving.<n>LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception.<n>It optimize spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis.
arXiv Detail & Related papers (2024-11-20T02:14:07Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Exploring Latent Pathways: Enhancing the Interpretability of Autonomous Driving with a Variational Autoencoder [79.70947339175572]
A bio-inspired neural circuit policy model has emerged as an innovative control module.
We take a leap forward by integrating a variational autoencoder with the neural circuit policy controller.
In addition to the architectural shift toward a variational autoencoder, this study introduces the automatic latent perturbation tool.
arXiv Detail & Related papers (2024-04-02T09:05:47Z) - VmambaIR: Visual State Space Model for Image Restoration [36.11385876754612]
We propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks.
VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters.
arXiv Detail & Related papers (2024-03-18T02:38:55Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Reason induced visual attention for explainable autonomous driving [2.090380922731455]
Deep learning (DL) based computer vision (CV) models are generally considered as black boxes due to poor interpretability.
This study is motivated by the need to enhance the interpretability of DL model in autonomous driving.
The proposed framework imitates the learning process of human drivers by jointly modeling the visual input (images) and natural language.
arXiv Detail & Related papers (2021-10-11T18:50:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.