Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition
- URL: http://arxiv.org/abs/2301.03819v1
- Date: Tue, 10 Jan 2023 07:07:20 GMT
- Title: Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition
- Authors: Piyush Behre, Sharman Tan, Padma Varadharajan and Shuangyu Chang
- Abstract summary: We propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows.
The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
- Score: 0.8670827427401333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While speech recognition Word Error Rate (WER) has reached human parity for
English, continuous speech recognition scenarios such as voice typing and
meeting transcriptions still suffer from segmentation and punctuation problems,
resulting from irregular pausing patterns or slow speakers. Transformer
sequence tagging models are effective at capturing long bi-directional context,
which is crucial for automatic punctuation. Automatic Speech Recognition (ASR)
production systems, however, are constrained by real-time requirements, making
it hard to incorporate the right context when making punctuation decisions.
Context within the segments produced by ASR decoders can be helpful but
limiting in overall punctuation performance for a continuous speech session. In
this paper, we propose a streaming approach for punctuation or re-punctuation
of ASR output using dynamic decoding windows and measure its impact on
punctuation and segmentation accuracy across scenarios. The new system tackles
over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming
punctuation achieves an average BLEUscore improvement of 0.66 for the
downstream task of Machine Translation (MT).
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation [0.08602553195689511]
We propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss.
By combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER)
arXiv Detail & Related papers (2023-06-02T06:46:14Z) - Semantic Segmentation with Bidirectional Language Models Improves
Long-form ASR [35.750921748001275]
We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance.
This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence.
arXiv Detail & Related papers (2023-05-28T19:31:45Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Streaming Punctuation for Long-form Dictation with Transformers [0.8670827427401333]
Streaming punctuation achieves an average BLEU-score gain of 0.66 for the downstream task of Machine Translation.
New system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2022-10-11T20:03:03Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Robust Prediction of Punctuation and Truecasing for Medical ASR [18.08508027663331]
This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
arXiv Detail & Related papers (2020-07-04T07:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.