Human Transcription Quality Improvement
- URL: http://arxiv.org/abs/2309.14372v1
- Date: Sun, 24 Sep 2023 03:39:43 GMT
- Title: Human Transcription Quality Improvement
- Authors: Jian Gao, Hanbo Sun, Cheng Cao, Zheng Du
- Abstract summary: We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage.
We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech.
- Score: 2.24166568188073
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: High quality transcription data is crucial for training automatic speech
recognition (ASR) systems. However, the existing industry-level data collection
pipelines are expensive to researchers, while the quality of crowdsourced
transcription is low. In this paper, we propose a reliable method to collect
speech transcriptions. We introduce two mechanisms to improve transcription
quality: confidence estimation based reprocessing at labeling stage, and
automatic word error correction at post-labeling stage. We collect and release
LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100
hours of English speech. Experiment shows the Transcription WER is reduced by
over 50%. We further investigate the impact of transcription error on ASR model
performance and found a strong correlation. The transcription quality
improvement provides over 10% relative WER reduction for ASR models. We release
the dataset and code to benefit the research community.
Related papers
- Measuring the Accuracy of Automatic Speech Recognition Solutions [4.99320937849508]
Automatic Speech Recognition (ASR) is now a part of many popular applications.
We measured the performance of eleven common ASR services with recordings of Higher Education lectures.
Our results show that accuracy ranges widely between vendors and for the individual audio samples.
We also measured a significant lower quality for streaming ASR, which is used for live events.
arXiv Detail & Related papers (2024-08-29T06:38:55Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - HTEC: Human Transcription Error Correction [4.241671683889168]
High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models.
We propose HTEC for Human Transcription Error Correction.
HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions.
arXiv Detail & Related papers (2023-09-18T19:03:21Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Alzheimer Disease Classification through ASR-based Transcriptions:
Exploring the Impact of Punctuation and Pauses [6.053166856632848]
Alzheimer's Disease (AD) is the world's leading neurodegenerative disease.
Recent ADReSS challenge provided a dataset for AD classification.
We used the new state-of-the-art Automatic Speech Recognition (ASR) model Whisper to obtain the transcriptions.
arXiv Detail & Related papers (2023-06-06T06:49:41Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for
Low-Resource Speech Recognition with Transducers [6.017182111335404]
Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems.
We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate.
arXiv Detail & Related papers (2023-05-23T03:50:35Z) - ASR Error Detection via Audio-Transcript entailment [1.3750624267664155]
We propose an end-to-end approach for ASR error detection using audio-transcript entailment.
The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively.
Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
arXiv Detail & Related papers (2022-07-22T02:47:15Z) - Textual Supervision for Visually Grounded Spoken Language Understanding [51.93744335044475]
Visually-grounded models of spoken language understanding extract semantic information directly from speech.
This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain.
Recent work showed that these models can be improved if transcriptions are available at training time.
arXiv Detail & Related papers (2020-10-06T15:16:23Z) - Improving Cross-Lingual Transfer Learning for End-to-End Speech
Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language.
We show that training ST with human translations is not necessary.
Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.