PickNet: Real-Time Channel Selection for Ad Hoc Microphone Arrays
- URL: http://arxiv.org/abs/2201.09586v1
- Date: Mon, 24 Jan 2022 10:52:43 GMT
- Title: PickNet: Real-Time Channel Selection for Ad Hoc Microphone Arrays
- Authors: Takuya Yoshioka, Xiaofei Wang, and Dongmei Wang
- Abstract summary: PickNet is a neural network model for real-time channel selection for an ad hoc microphone array consisting of multiple recording devices like cell phones.
The proposed model yielded significant gains in word error rate with limited computational cost over systems using a block-online beamformer and a single distant microphone.
- Score: 15.788867107071244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes PickNet, a neural network model for real-time channel
selection for an ad hoc microphone array consisting of multiple recording
devices like cell phones. Assuming at most one person to be vocally active at
each time point, PickNet identifies the device that is spatially closest to the
active person for each time frame by using a short spectral patch of just
hundreds of milliseconds. The model is applied to every time frame, and the
short time frame signals from the selected microphones are concatenated across
the frames to produce an output signal. As the personal devices are usually
held close to their owners, the output signal is expected to have higher
signal-to-noise and direct-to-reverberation ratios on average than the input
signals. Since PickNet utilizes only limited acoustic context at each time
frame, the system using the proposed model works in real time and is robust to
changes in acoustic conditions. Speech recognition-based evaluation was carried
out by using real conversational recordings obtained with various smartphones.
The proposed model yielded significant gains in word error rate with limited
computational cost over systems using a block-online beamformer and a single
distant microphone.
Related papers
- Binaural Angular Separation Network [7.4471290433964406]
We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones.
The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.
arXiv Detail & Related papers (2024-01-16T22:36:12Z) - Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z) - Universal speaker recognition encoders for different speech segments
duration [7.104489204959814]
A system trained simultaneously on pooled short and long speech segments does not give optimal verification results.
We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture.
arXiv Detail & Related papers (2022-10-28T16:06:00Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency [3.119625275101153]
System is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation.
Full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
arXiv Detail & Related papers (2021-11-17T11:46:43Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Neural Speech Separation Using Spatially Distributed Microphones [19.242927805448154]
This paper proposes a neural network based speech separation method using spatially distributed microphones.
Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance.
Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.
arXiv Detail & Related papers (2020-04-28T17:16:31Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.