VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition
- URL: http://arxiv.org/abs/2009.04323v1
- Date: Wed, 9 Sep 2020 14:26:56 GMT
- Title: VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition
- Authors: Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan
Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika,
Alexander Gruenstein
- Abstract summary: We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
- Score: 60.462770498366524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce VoiceFilter-Lite, a single-channel source separation model that
runs on the device to preserve only the speech signals from a target user, as
part of a streaming speech recognition system. Delivering such a model presents
numerous challenges: It should improve the performance when the input signal
consists of overlapped speech, and must not hurt the speech recognition
performance under all other acoustic conditions. Besides, this model must be
tiny, fast, and perform inference in a streaming fashion, in order to have
minimal impact on CPU, memory, battery and latency. We propose novel techniques
to meet these multi-faceted requirements, including using a new asymmetric
loss, and adopting adaptive runtime suppression strength. We also show that
such a model can be quantized as a 8-bit integer model and run in realtime.
Related papers
- Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - StreamVC: Real-Time Low-Latency Voice Conversion [20.164321451712564]
StreamVC is a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech.
StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform.
arXiv Detail & Related papers (2024-01-05T22:37:26Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - Guided Speech Enhancement Network [17.27704800294671]
Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model.
We propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model.
We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network.
arXiv Detail & Related papers (2023-03-13T21:48:20Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.