Related papers: Multi-Channel Replay Speech Detection using Acoustic Maps

Multi-Channel Replay Speech Detection using Acoustic Maps

URL: http://arxiv.org/abs/2602.16399v1
Date: Wed, 18 Feb 2026 12:18:45 GMT
Title: Multi-Channel Replay Speech Detection using Acoustic Maps
Authors: Michael Neri, Tuomas Virtanen,
Abstract summary: We propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings.<n>A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset.
Score: 8.466109515054315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

Related papers

Acoustic Simulation Framework for Multi-channel Replay Speech Detection [8.466109515054315]
Replay speech attacks pose a significant threat to voice-controlled systems.<n>We introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations.
arXiv Detail & Related papers (2025-09-18T09:38:58Z)
AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z)
Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene. Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion. Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z)
Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction [18.641610823584433]
We introduce an unsupervised data-driven approach that exploits the natural structure of the data. Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements.
arXiv Detail & Related papers (2023-01-01T17:46:09Z)
End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder. We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z)
Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables. Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z)
Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.