End-to-end Speech-to-Punctuated-Text Recognition
- URL: http://arxiv.org/abs/2207.03169v1
- Date: Thu, 7 Jul 2022 08:58:01 GMT
- Title: End-to-end Speech-to-Punctuated-Text Recognition
- Authors: Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto
- Abstract summary: punctuation marks are important for the readability of the speech recognition results.
Conventional automatic speech recognition systems do not produce punctuation marks.
We propose an end-to-end model that takes speech as input and outputs punctuated texts.
- Score: 23.44236710364419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional automatic speech recognition systems do not produce punctuation
marks which are important for the readability of the speech recognition
results. They are also needed for subsequent natural language processing tasks
such as machine translation. There have been a lot of works on punctuation
prediction models that insert punctuation marks into speech recognition results
as post-processing. However, these studies do not utilize acoustic information
for punctuation prediction and are directly affected by speech recognition
errors. In this study, we propose an end-to-end model that takes speech as
input and outputs punctuated texts. This model is expected to predict
punctuation robustly against speech recognition errors while using acoustic
information. We also propose to incorporate an auxiliary loss to train the
model using the output of the intermediate layer and unpunctuated texts.
Through experiments, we compare the performance of the proposed model to that
of a cascaded system. The proposed model achieves higher punctuation prediction
accuracy than the cascaded system without sacrificing the speech recognition
error rate. It is also demonstrated that the multi-task learning using the
intermediate output against the unpunctuated text is effective. Moreover, the
proposed model has only about 1/7th of the parameters compared to the cascaded
system.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Careful Whisper -- leveraging advances in automatic speech recognition
for robust and interpretable aphasia subtype classification [0.0]
This paper presents a fully automated approach for identifying speech anomalies from voice recordings to aid in the assessment of speech impairments.
By combining Connectionist Temporal Classification (CTC) and encoder-decoder-based automatic speech recognition models, we generate rich acoustic and clean transcripts.
We then apply several natural language processing methods to extract features from these transcripts to produce prototypes of healthy speech.
arXiv Detail & Related papers (2023-08-02T15:53:59Z) - Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation [0.08602553195689511]
We propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss.
By combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER)
arXiv Detail & Related papers (2023-06-02T06:46:14Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Token-Level Supervised Contrastive Learning for Punctuation Restoration [7.9713449581347104]
Punctuation is critical in understanding natural language text.
Most automatic speech recognition systems do not generate punctuation.
Recent work in punctuation restoration heavily utilizes pre-trained language models.
arXiv Detail & Related papers (2021-07-19T18:24:33Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - End to End ASR System with Automatic Punctuation Insertion [0.0]
We propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com.
We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals.
arXiv Detail & Related papers (2020-12-03T15:46:43Z) - Replacing Human Audio with Synthetic Audio for On-device Unspoken
Punctuation Prediction [10.516452073178511]
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features.
We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem.
arXiv Detail & Related papers (2020-10-20T11:30:26Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.