Related papers: From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection

From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection

URL: http://arxiv.org/abs/2506.23437v1
Date: Mon, 30 Jun 2025 00:21:07 GMT
Title: From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection
Authors: Stefano Giacomelli, Marco Giordano, Claudia Rinaldi, Fabio Graziosi,
Abstract summary: This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture for binary EV siren detection.<n>We fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware.<n>Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.
Score: 0.26249027950824516
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Accurate recognition of Emergency Vehicle (EV) sirens is critical for the integration of intelligent transportation systems, smart city monitoring systems, and autonomous driving technologies. Modern automatic solutions are limited by the lack of large scale, curated datasets and by the computational demands of state of the art sound event detection models. This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture derived from the PANNs framework, specifically optimized for binary EV siren detection. Leveraging our dedicated subset of AudioSet (AudioSet EV) we fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware. The experimental campaign includes ablation studies, cross-domain benchmarking, and real-time inference deployment on edge device. Interpretability analyses exploiting Guided Backpropagation and ScoreCAM algorithms provide insights into the model internal representations and validate its ability to capture distinct spectrotemporal patterns associated with different types of EV sirens. Real time performance is assessed through frame wise and event based detection metrics, as well as a detailed analysis of false positive activations. Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.

Related papers

Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware [0.26249027950824516]
We present a full-stack emergency vehicle siren detection system designed for real-time deployment on embedded hardware.<n>The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs.<n>A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities.
arXiv Detail & Related papers (2025-07-02T10:27:41Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Real-Time Pedestrian Detection on IoT Edge Devices: A Lightweight Deep Learning Approach [1.4732811715354455]
This research explores implementing a lightweight deep learning model on Artificial Intelligence of Things (AIoT) edge devices. An optimized You Only Look Once (YOLO) based DL model is deployed for real-time pedestrian detection. The simulation results demonstrate that the optimized YOLO model can achieve real-time pedestrian detection, with a fast inference speed of 147 milliseconds, a frame rate of 2.3 frames per second, and an accuracy of 78%.
arXiv Detail & Related papers (2024-09-24T04:48:41Z)
A Real-Time Voice Activity Detection Based On Lightweight Neural [4.589472292598182]
Voice activity detection (VAD) is the task of detecting speech in an audio stream. Recent neural network-based VADs have alleviated the degradation of performance to some extent. We propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU.
arXiv Detail & Related papers (2024-05-27T03:31:16Z)
Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z)
Real-time Aerial Detection and Reasoning on Embedded-UAVs [3.0839245814393728]
We present a unified pipeline architecture for a real-time detection system on an embedded system for UAVs. This pipeline of networks can exploit the domain-specific knowledge on aerial pedestrian detection and activity recognition.
arXiv Detail & Related papers (2023-05-21T09:43:17Z)
Detecting train driveshaft damages using accelerometer signals and Differential Convolutional Neural Networks [67.60224656603823]
This paper proposes the development of a railway axle condition monitoring system based on advanced 2D-Convolutional Neural Network (CNN) architectures. The resultant system converts the railway axle vibration signals into time-frequency domain representations, i.e., spectrograms, and, thus, trains a two-dimensional CNN to classify them depending on their cracks.
arXiv Detail & Related papers (2022-11-15T15:04:06Z)
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection) In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z)
Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
A Multi-view CNN-based Acoustic Classification System for Automatic Animal Species Identification [42.119250432849505]
We propose a deep learning based acoustic classification framework for Wireless Acoustic Sensor Network (WASN) The proposed framework is based on cloud architecture which relaxes the computational burden on the wireless sensor node. To improve the recognition accuracy, we design a multi-view Convolution Neural Network (CNN) to extract the short-, middle-, and long-term dependencies in parallel.
arXiv Detail & Related papers (2020-02-23T03:51:08Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.