Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation
- URL: http://arxiv.org/abs/2306.01296v1
- Date: Fri, 2 Jun 2023 06:46:14 GMT
- Title: Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation
- Authors: Hanbyul Kim, Seunghyun Seo, Lukas Lee, Seolki Baek
- Abstract summary: We propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss.
By combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER)
- Score: 0.08602553195689511
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Punctuated text prediction is crucial for automatic speech recognition as it
enhances readability and impacts downstream natural language processing tasks.
In streaming scenarios, the ability to predict punctuation in real-time is
particularly desirable but presents a difficult technical challenge. In this
work, we propose a method for predicting punctuated text from input speech
using a chunk-based Transformer encoder trained with Connectionist Temporal
Classification (CTC) loss. The acoustic model trained with long sequences by
concatenating the input and target sequences can learn punctuation marks
attached to the end of sentences more effectively. Additionally, by combining
CTC losses on the chunks and utterances, we achieved both the improved F1 score
of punctuation prediction and Word Error Rate (WER).
Related papers
- Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition [0.8670827427401333]
We propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows.
The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2023-01-10T07:07:20Z) - Assessing Phrase Break of ESL speech with Pre-trained Language Models [6.635783609515407]
This work introduces an approach to assessing phrase break in ESL learners' speech with pre-trained language models (PLMs)
Different with traditional methods, this proposal converts speech to token sequences, and then leverages the power of PLMs.
arXiv Detail & Related papers (2022-10-28T10:06:06Z) - Streaming Punctuation for Long-form Dictation with Transformers [0.8670827427401333]
Streaming punctuation achieves an average BLEU-score gain of 0.66 for the downstream task of Machine Translation.
New system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2022-10-11T20:03:03Z) - End-to-end Speech-to-Punctuated-Text Recognition [23.44236710364419]
punctuation marks are important for the readability of the speech recognition results.
Conventional automatic speech recognition systems do not produce punctuation marks.
We propose an end-to-end model that takes speech as input and outputs punctuated texts.
arXiv Detail & Related papers (2022-07-07T08:58:01Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Token-Level Supervised Contrastive Learning for Punctuation Restoration [7.9713449581347104]
Punctuation is critical in understanding natural language text.
Most automatic speech recognition systems do not generate punctuation.
Recent work in punctuation restoration heavily utilizes pre-trained language models.
arXiv Detail & Related papers (2021-07-19T18:24:33Z) - Investigating the Reordering Capability in CTC-based Non-Autoregressive
End-to-End Speech Translation [62.943925893616196]
We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC)
CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability.
Our analysis shows that transformer encoders have the ability to change the word order.
arXiv Detail & Related papers (2021-05-11T07:48:45Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.