Do You Listen with One or Two Microphones? A Unified ASR Model for
Single and Multi-Channel Audio
- URL: http://arxiv.org/abs/2106.02750v1
- Date: Fri, 4 Jun 2021 22:58:42 GMT
- Title: Do You Listen with One or Two Microphones? A Unified ASR Model for
Single and Multi-Channel Audio
- Authors: Gokce Keskin, Minhua Wu, Brian King, Harish Mallidi, Yang Gao, \\Jasha
Droppo, Ariya Rastrow, Roland Maas
- Abstract summary: We propose a unified ASR model that can serve both textitprimary-only (PO) and textitprimary-plus-auxiliary (PPA) modes.
We demonstrate its efficacy in a realistic scenario where a set of devices typically stream a single primary audio channel, and two additional auxiliary channels textitonly when upload bandwidth allows it.
- Score: 20.932685675759117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) models are typically designed to operate
on a single input data type, e.g. a single or multi-channel audio streamed from
a device. This design decision assumes the \textit{primary} input data source
does not change and if an additional (\textit{auxiliary}) data source is
occasionally available, it cannot be used. An ASR model that operates on both
primary and auxiliary data can achieve better accuracy compared to a
primary-only solution; and a model that can serve both \textit{primary-only}
(PO) and \textit{primary-plus-auxiliary} (PPA) modes is highly desirable. In
this work, we propose a unified ASR model that can serve both modes. We
demonstrate its efficacy in a realistic scenario where a set of devices
typically stream a single primary audio channel, and two additional auxiliary
channels \textit{only when} upload bandwidth allows it. The architecture
enables a unique methodology that uses both types of input audio during
training time. Our proposed approach achieves up to 12.5\% relative
word-error-rate reduction (WERR) compared to a PO baseline, and up to 16.0\%
relative WERR in low-SNR conditions. The unique training methodology achieves
up to 2.5\% relative WERR compared to a PPA baseline.
Related papers
- Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z) - OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z) - Adapting an Unadaptable ASR System [40.402050390096456]
We consider the recently released OpenAI Whisper ASR as an example of a large-scale ASR system to assess adaptation methods.
An error correction based approach is adopted, as this does not require access to the model.
The generalization ability of the system in two distinct dimensions are then evaluated.
arXiv Detail & Related papers (2023-06-01T23:54:11Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data [26.058278155958668]
We propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet.
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
The proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training.
arXiv Detail & Related papers (2021-12-15T05:13:43Z) - Cross-Modal Transformer-Based Neural Correction Models for Automatic
Speech Recognition [31.2558640840697]
We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition system.
Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
arXiv Detail & Related papers (2021-07-04T07:58:31Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.