Streaming Punctuation for Long-form Dictation with Transformers
- URL: http://arxiv.org/abs/2210.05756v1
- Date: Tue, 11 Oct 2022 20:03:03 GMT
- Title: Streaming Punctuation for Long-form Dictation with Transformers
- Authors: Piyush Behre, Sharman Tan, Padma Varadharajan, Shuangyu Chang
- Abstract summary: Streaming punctuation achieves an average BLEU-score gain of 0.66 for the downstream task of Machine Translation.
New system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
- Score: 0.8670827427401333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While speech recognition Word Error Rate (WER) has reached human parity for
English, long-form dictation scenarios still suffer from segmentation and
punctuation problems resulting from irregular pausing patterns or slow
speakers. Transformer sequence tagging models are effective at capturing long
bi-directional context, which is crucial for automatic punctuation. A typical
Automatic Speech Recognition (ASR) production system, however, is constrained
by real-time requirements, making it hard to incorporate the right context when
making punctuation decisions. In this paper, we propose a streaming approach
for punctuation or re-punctuation of ASR output using dynamic decoding windows
and measure its impact on punctuation and segmentation accuracy in a variety of
scenarios. The new system tackles over-segmentation issues, improving
segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average
BLEU-score gain of 0.66 for the downstream task of Machine Translation (MT).
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech
Model [84.12646619522774]
We show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks.
It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling.
We also show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR.
arXiv Detail & Related papers (2023-10-04T17:07:32Z) - Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation [0.08602553195689511]
We propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss.
By combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER)
arXiv Detail & Related papers (2023-06-02T06:46:14Z) - Semantic Segmentation with Bidirectional Language Models Improves
Long-form ASR [35.750921748001275]
We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance.
This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence.
arXiv Detail & Related papers (2023-05-28T19:31:45Z) - Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition [0.8670827427401333]
We propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows.
The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2023-01-10T07:07:20Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - End to End ASR System with Automatic Punctuation Insertion [0.0]
We propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com.
We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals.
arXiv Detail & Related papers (2020-12-03T15:46:43Z) - Robust Prediction of Punctuation and Truecasing for Medical ASR [18.08508027663331]
This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
arXiv Detail & Related papers (2020-07-04T07:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.