Related papers: Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware

Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware

URL: http://arxiv.org/abs/2507.01563v1
Date: Wed, 02 Jul 2025 10:27:41 GMT
Title: Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware
Authors: Marco Giordano, Stefano Giacomelli, Claudia Rinaldi, Fabio Graziosi,
Abstract summary: We present a full-stack emergency vehicle siren detection system designed for real-time deployment on embedded hardware.<n>The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs.<n>A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities.
Score: 0.26249027950824516
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a full-stack emergency vehicle (EV) siren detection system designed for real-time deployment on embedded hardware. The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs, and optimized for binary sound event detection under urban acoustic conditions. A key contribution is the creation of curated and semantically structured datasets - AudioSet-EV, AudioSet-EV Augmented, and Unified-EV - developed using a custom AudioSet-Tools framework to overcome the low reliability of standard AudioSet annotations. The system is deployed on a Raspberry Pi 5 equipped with a high-fidelity DAC+microphone board, implementing a multithreaded inference engine with adaptive frame sizing, probability smoothing, and a decision-state machine to control false positive activations. A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities. Performance is evaluated using both framewise and event-based metrics across multiple configurations. Results show the system achieves low-latency detection with improved robustness under realistic audio conditions. This work demonstrates the feasibility of deploying IoS-compatible SED solutions that can form distributed acoustic monitoring networks, enabling collaborative emergency vehicle tracking across smart city infrastructures through WebSocket connectivity on low-cost edge devices.

Related papers

From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection [0.26249027950824516]
This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture for binary EV siren detection.<n>We fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware.<n>Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.
arXiv Detail & Related papers (2025-06-30T00:21:07Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms [0.5850093728139567]
Deepfake audio poses a rising threat in communication platforms, necessitating real-time detection for audio stream integrity. This study assesses the viability of employing static deepfake audio detection models in real-time communication platforms. Two deepfake audio detection models based on Resnet and LCNN architectures are implemented.
arXiv Detail & Related papers (2024-03-18T13:35:10Z)
Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side. To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
Audio Tagging on an Embedded Hardware Platform [20.028643659869573]
We analyze how the performance of large-scale pretrained audio neural networks changes when deployed on a hardware such as Raspberry Pi. Our experiments reveal that the continuous CPU usage results in an increased temperature that can trigger an automated slowdown mechanism. The quality of a microphone, specifically with affordable devices like the Google AIY Voice Kit, and audio signal volume, all affect the system performance.
arXiv Detail & Related papers (2023-06-15T13:02:41Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.