DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in
Indo-European Languages
- URL: http://arxiv.org/abs/2310.16749v1
- Date: Wed, 25 Oct 2023 16:32:02 GMT
- Title: DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in
Indo-European Languages
- Authors: Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya
- Abstract summary: Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text.
We present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French.
We show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system.
- Score: 68.66827612799577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Disfluency correction (DC) is the process of removing disfluent elements like
fillers, repetitions and corrections from spoken utterances to create readable
and interpretable text. DC is a vital post-processing step applied to Automatic
Speech Recognition (ASR) outputs, before subsequent processing by downstream
language understanding tasks. Existing DC research has primarily focused on
English due to the unavailability of large-scale open-source datasets. Towards
the goal of multilingual disfluency correction, we present a high-quality
human-annotated DC corpus covering four important Indo-European languages:
English, Hindi, German and French. We provide extensive analysis of results of
state-of-the-art DC models across all four languages obtaining F1 scores of
97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To
demonstrate the benefits of DC on downstream tasks, we show that DC leads to
5.65 points increase in BLEU scores on average when used in conjunction with a
state-of-the-art Machine Translation (MT) system. We release code to run our
experiments along with our annotated dataset here.
Related papers
- Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Congolese Swahili Machine Translation for Humanitarian Response [0.05526111147542002]
We describe our efforts to make a bidirectional Congolese Swahili to French neural machine translation system.
For training, we created a 25,302-sentence general domain parallel corpus.
We recorded improvements of up to 2.4 and 3.5 BLEU points in the SWC-FRA and FRA-SWC directions.
arXiv Detail & Related papers (2021-03-19T11:15:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.