Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
- URL: http://arxiv.org/abs/2504.16102v1
- Date: Tue, 15 Apr 2025 21:10:17 GMT
- Title: Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
- Authors: Xiwen Li, Ross Whitaker, Tolga Tasdizen,
- Abstract summary: Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior.<n>In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off.<n>We propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads.
- Score: 1.2699007098398802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues -- video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.
Related papers
- Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.<n>The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z) - Joint Audio-Visual Idling Vehicle Detection with Streamlined Input Dependencies [2.8517252798391177]
Idling vehicle detection can be helpful in monitoring and reducing unnecessary idling.
We introduce an end-to-end joint audio-visual IVD task.
Unlike feature co-occurrence task such as audio-visual vehicle tracking, our IVD task addresses complementary features.
arXiv Detail & Related papers (2024-10-28T16:13:44Z) - EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving.
We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z) - Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes [70.08318779492944]
We are the first to harness vanishing point (VP) priors for more effective segmentation.
Our novel, efficient network for VSS, named VPSeg, incorporates two modules that utilize exactly this pair of static and dynamic VP priors.
arXiv Detail & Related papers (2024-01-27T01:01:58Z) - Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos [22.16190711818432]
We introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection.
Unlike previous approaches, the supervised signal of our method is derived from languages rather than one-hot vectors, providing a more comprehensive representation.
It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset.
arXiv Detail & Related papers (2024-01-07T15:47:19Z) - Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer [12.398902878803034]
This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze driving behavior.
The proposed-ViT model consists of three key components: Transformer Module, Channel Attention Module, and Multi-Channel ViT Module.
arXiv Detail & Related papers (2023-10-21T04:24:30Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.