Binaural Angular Separation Network
- URL: http://arxiv.org/abs/2401.08864v1
- Date: Tue, 16 Jan 2024 22:36:12 GMT
- Title: Binaural Angular Separation Network
- Authors: Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee,
Matthias Grundmann
- Abstract summary: We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones.
The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.
- Score: 7.4471290433964406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a neural network model that can separate target speech sources
from interfering sources at different angular regions using two microphones.
The model is trained with simulated room impulse responses (RIRs) using
omni-directional microphones without needing to collect real RIRs. By relying
on specific angular regions and multiple room simulations, the model utilizes
consistent time difference of arrival (TDOA) cues, or what we call delay
contrast, to separate target and interference sources while remaining robust in
various reverberation environments. We demonstrate the model is not only
generalizable to a commercially available device with a slightly different
microphone geometry, but also outperforms our previous work which uses one
additional microphone on the same device. The model runs in real-time on-device
and is suitable for low-latency streaming applications such as telephony and
video conferencing.
Related papers
- Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - PickNet: Real-Time Channel Selection for Ad Hoc Microphone Arrays [15.788867107071244]
PickNet is a neural network model for real-time channel selection for an ad hoc microphone array consisting of multiple recording devices like cell phones.
The proposed model yielded significant gains in word error rate with limited computational cost over systems using a block-online beamformer and a single distant microphone.
arXiv Detail & Related papers (2022-01-24T10:52:43Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.