Acoustic Simulation Framework for Multi-channel Replay Speech Detection
- URL: http://arxiv.org/abs/2509.14789v1
- Date: Thu, 18 Sep 2025 09:38:58 GMT
- Title: Acoustic Simulation Framework for Multi-channel Replay Speech Detection
- Authors: Michael Neri, Tuomas Virtanen,
- Abstract summary: Replay speech attacks pose a significant threat to voice-controlled systems.<n>We introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations.
- Score: 8.466109515054315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Replay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. In this work, we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Our setup models both genuine and spoofed speech across varied environments, including realistic microphone and loudspeaker impulse responses, room acoustics, and noise conditions. The framework employs measured loudspeaker directionalities during the replay attack to improve the realism of the simulation. We define two spoofing settings, which simulate whether a reverberant or an anechoic speech is used in the replay scenario, and evaluate the impact of omnidirectional and diffuse noise on detection performance. Using the state-of-the-art M-ALRAD model for replay speech detection, we demonstrate that synthetic data can support the generalization capabilities of the detector across unseen enclosures.
Related papers
- Visual-Informed Speech Enhancement Using Attention-Based Beamforming [13.084978776817222]
We propose a novel Visual-Informed Neural Beamforming Network (VI-NBFNet)<n>The proposed network integrates microphone array signal processing and deep neural networks (DNNs) using multimodal input features.<n>It is intended to handle both static and moving speakers by introducing a supervised end-to-end beamforming framework equipped with an attention mechanism.
arXiv Detail & Related papers (2026-03-05T15:19:41Z) - Multi-Channel Replay Speech Detection using Acoustic Maps [8.466109515054315]
We propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings.<n>A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset.
arXiv Detail & Related papers (2026-02-18T12:18:45Z) - Reciprocal Latent Fields for Precomputed Sound Propagation [0.6474760227870046]
We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting acoustic parameters.<n>We show that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude.
arXiv Detail & Related papers (2026-02-06T18:31:11Z) - Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks [0.0]
Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion.<n>This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion.
arXiv Detail & Related papers (2025-12-31T02:06:42Z) - Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers [2.614081506519191]
We propose a novel mixture of experts framework for field-of-view enhancement in signal matching.<n>Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions.<n>This allows for realtime tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality.
arXiv Detail & Related papers (2025-09-16T21:30:06Z) - V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.<n>To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input.<n>To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z) - End-to-end multi-channel speaker extraction and binaural speech synthesis [26.373624846079686]
Speech clarity and spatial audio immersion are two most critical factors in enhancing remote conferencing experiences.<n>We introduce an end-to-end deep learning framework that has the capacity of mapping multi-channel noisy and reverberant signals to clean and spatialized speech directly.<n>In this framework, a novel magnitude-weighted interaural level difference loss function is proposed that aims to improve the accuracy of spatial rendering.
arXiv Detail & Related papers (2024-10-08T06:55:35Z) - Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture [11.063156506583562]
We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy.
We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs.
Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
arXiv Detail & Related papers (2024-06-05T13:50:59Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Multimodal Speech Recognition with Unstructured Audio Masking [49.01826387664443]
We simulate a more realistic masking scenario during model training, called RandWordMask.
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words.
Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
arXiv Detail & Related papers (2020-10-16T21:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.