End to End ASR System with Automatic Punctuation Insertion
- URL: http://arxiv.org/abs/2012.02012v1
- Date: Thu, 3 Dec 2020 15:46:43 GMT
- Title: End to End ASR System with Automatic Punctuation Insertion
- Authors: Yushi Guan
- Abstract summary: We propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com.
We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Automatic Speech Recognition systems have been moving towards
end-to-end systems that can be trained together. Numerous techniques that have
been proposed recently enabled this trend, including feature extraction with
CNNs, context capturing and acoustic feature modeling with RNNs, automatic
alignment of input and output sequences using Connectionist Temporal
Classifications, as well as replacing traditional n-gram language models with
RNN Language Models. Historically, there has been a lot of interest in
automatic punctuation in textual or speech to text context. However, there
seems to be little interest in incorporating automatic punctuation into the
emerging neural network based end-to-end speech recognition systems, partially
due to the lack of English speech corpus with punctuated transcripts. In this
study, we propose a method to generate punctuated transcript for the TEDLIUM
dataset using transcripts available from ted.com. We also propose an end-to-end
ASR system that outputs words and punctuations concurrently from speech
signals. Combining Damerau Levenshtein Distance and slot error rate into
DLev-SER, we enable measurement of punctuation error rate when the hypothesis
text is not perfectly aligned with the reference. Compared with previous
methods, our model reduces slot error rate from 0.497 to 0.341.
Related papers
- Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system.
We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances.
Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents [11.193867567895353]
Goal-oriented conversational interfaces are designed to accomplish specific tasks.
We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time.
Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
arXiv Detail & Related papers (2021-03-18T15:38:08Z) - WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal
Classification Paradigm [0.0]
We propose a new balanced paradigm for e-WER in a classification setting.
Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER.
The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.
arXiv Detail & Related papers (2021-01-14T07:26:28Z) - Replacing Human Audio with Synthetic Audio for On-device Unspoken
Punctuation Prediction [10.516452073178511]
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features.
We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem.
arXiv Detail & Related papers (2020-10-20T11:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.