Semantic Segmentation with Bidirectional Language Models Improves
Long-form ASR
- URL: http://arxiv.org/abs/2305.18419v1
- Date: Sun, 28 May 2023 19:31:45 GMT
- Title: Semantic Segmentation with Bidirectional Language Models Improves
Long-form ASR
- Authors: W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N.
Sainath
- Abstract summary: We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance.
This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence.
- Score: 35.750921748001275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method of segmenting long-form speech by separating semantically
complete sentences within the utterance. This prevents the ASR decoder from
needlessly processing faraway context while also preventing it from missing
relevant context within the current sentence. Semantically complete sentence
boundaries are typically demarcated by punctuation in written text; but
unfortunately, spoken real-world utterances rarely contain punctuation. We
address this limitation by distilling punctuation knowledge from a
bidirectional teacher language model (LM) trained on written, punctuated text.
We compare our segmenter, which is distilled from the LM teacher, against a
segmenter distilled from a acoustic-pause-based teacher used in other works, on
a streaming ASR pipeline. The pipeline with our segmenter achieves a 3.2%
relative WER gain along with a 60 ms median end-of-segment latency reduction on
a YouTube captioning task.
Related papers
- SSR: Alignment-Aware Modality Connector for Speech Language Models [23.859649312290447]
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality.
We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion.
arXiv Detail & Related papers (2024-09-30T19:17:46Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Boosting Weakly-Supervised Temporal Action Localization with Text
Information [94.48602948837664]
We propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.
We also introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence.
Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin.
arXiv Detail & Related papers (2023-05-01T00:07:09Z) - Cascading and Direct Approaches to Unsupervised Constituency Parsing on
Spoken Sentences [67.37544997614646]
We present the first study on unsupervised spoken constituency parsing.
The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees.
We show that accurate segmentation alone may be sufficient to parse spoken sentences accurately.
arXiv Detail & Related papers (2023-03-15T17:57:22Z) - Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition [0.8670827427401333]
We propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows.
The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2023-01-10T07:07:20Z) - Streaming Punctuation for Long-form Dictation with Transformers [0.8670827427401333]
Streaming punctuation achieves an average BLEU-score gain of 0.66 for the downstream task of Machine Translation.
New system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2022-10-11T20:03:03Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.