Improving Readability for Automatic Speech Recognition Transcription
- URL: http://arxiv.org/abs/2004.04438v1
- Date: Thu, 9 Apr 2020 09:26:42 GMT
- Title: Improving Readability for Automatic Speech Recognition Transcription
- Authors: Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi, Ming Gong, Linjun
Shou, Hong Qu, Michael Zeng
- Abstract summary: We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
- Score: 50.86019112545596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Automatic Speech Recognition (ASR) systems can achieve high
performance in terms of recognition accuracy. However, a perfectly accurate
transcript still can be challenging to read due to grammatical errors,
disfluency, and other errata common in spoken communication. Many downstream
tasks and human readers rely on the output of the ASR system; therefore, errors
introduced by the speaker and ASR system alike will be propagated to the next
task in the pipeline. In this work, we propose a novel NLP task called ASR
post-processing for readability (APR) that aims to transform the noisy ASR
output into a readable text for humans and downstream tasks while maintaining
the semantic meaning of the speaker. In addition, we describe a method to
address the lack of task-specific data by synthesizing examples for the APR
task using the datasets collected for Grammatical Error Correction (GEC)
followed by text-to-speech (TTS) and ASR. Furthermore, we propose metrics
borrowed from similar tasks to evaluate performance on the APR task. We compare
fine-tuned models based on several open-sourced and adapted pre-trained models
with the traditional pipeline method. Our results suggest that finetuned models
improve the performance on the APR task significantly, hinting at the potential
benefits of using APR systems. We hope that the read, understand, and rewrite
approach of our work can serve as a basis that many NLP tasks and human readers
can benefit from.
Related papers
- Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR [3.717584661565119]
TokenVerse is a single Transducer-based model designed to handle multiple tasks.
It is achieved by integrating task-specific tokens into the reference text during ASR model training.
Our experiments show that the proposed method improves ASR by up to 7.7% in relative WER.
arXiv Detail & Related papers (2024-07-05T11:54:38Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z) - Generating Human Readable Transcript for Automatic Speech Recognition
with Pre-trained Language Model [18.26945997660616]
Many downstream tasks and human readers rely on the output of the ASR system.
We propose an ASR post-processing model that aims to transform the incorrect and noisy ASR output into a readable text.
arXiv Detail & Related papers (2021-02-22T15:45:50Z) - Knowledge Distillation for Improved Accuracy in Spoken Question
Answering [63.72278693825945]
We devise a training strategy to perform knowledge distillation from spoken documents and written counterparts.
Our work makes a step towards distilling knowledge from the language model as a supervision signal.
Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset.
arXiv Detail & Related papers (2020-10-21T15:18:01Z) - Contextualized Attention-based Knowledge Transfer for Spoken
Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow.
We propose CADNet, a novel contextualized attention-based distillation approach.
We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.