Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression
- URL: http://arxiv.org/abs/2601.07092v1
- Date: Sun, 11 Jan 2026 23:25:49 GMT
- Title: Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression
- Authors: Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu,
- Abstract summary: Current state-of-the-art VQA models prioritize performance over computational efficiency.<n>We propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline.<n> Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance.
- Score: 5.459169631906009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.
Related papers
- A Serverless Edge-Native Data Processing Architecture for Autonomous Driving Training [0.0]
This paper introduces the framework, an edge-native platform that enables on-vehicle data filtering and processing through user-defined functions.<n>We evaluate the framework on an NVIDIA Jetson Orin Nano and compare it against native ROS 2 deployments.<n>Results show competitive performance, reduced latency and jitter, and confirm that Lambda-based abstractions can support real-time data processing in embedded autonomous driving systems.
arXiv Detail & Related papers (2026-01-30T12:41:11Z) - FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z) - SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [49.84148668264725]
We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages.<n>Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks.
arXiv Detail & Related papers (2025-10-20T17:35:47Z) - FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning [75.80110543049783]
We propose FastDriveVLA, a reconstruction-based vision token pruning framework for autonomous driving.<n>A novel foreground adversarial-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models.<n>Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
arXiv Detail & Related papers (2025-07-31T07:55:56Z) - Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving [29.019907345552475]
Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities.<n>Existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy.<n>This paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations.
arXiv Detail & Related papers (2025-06-05T12:59:35Z) - FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z) - Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z) - LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement [4.534832757549232]
We introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving.<n>LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception.<n>It optimize spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis.
arXiv Detail & Related papers (2024-11-20T02:14:07Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.