E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
- URL: http://arxiv.org/abs/2204.10749v1
- Date: Fri, 22 Apr 2022 15:13:12 GMT
- Title: E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
- Authors: W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar,
Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu
- Abstract summary: We propose an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion.
We demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.
- Score: 38.79441296832869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving the performance of end-to-end ASR models on long utterances ranging
from minutes to hours in length is an ongoing challenge in speech recognition.
A common solution is to segment the audio in advance using a separate voice
activity detector (VAD) that decides segment boundary locations based purely on
acoustic speech/non-speech information. VAD segmenters, however, may be
sub-optimal for real-world speech where, e.g., a complete sentence that should
be taken as a whole may contain hesitations in the middle ("set an alarm for...
5 o'clock").
We propose to replace the VAD with an end-to-end ASR model capable of
predicting segment boundaries in a streaming fashion, allowing the segmentation
decision to be conditioned not only on better acoustic features but also on
semantic features from the decoded text with negligible extra computation. In
experiments on real world long-form audio (YouTube) with lengths of up to 30
minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in
median end-of-segment latency compared to the VAD segmenter baseline on a
state-of-the-art Conformer RNN-T model.
Related papers
- REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Unified End-to-End Speech Recognition and Endpointing for Fast and
Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model.
We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model.
This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z) - Universal speaker recognition encoders for different speech segments
duration [7.104489204959814]
A system trained simultaneously on pooled short and long speech segments does not give optimal verification results.
We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture.
arXiv Detail & Related papers (2022-10-28T16:06:00Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.