Artificial Disfluency Detection, Uh No, Disfluency Generation for the
Masses
- URL: http://arxiv.org/abs/2211.09235v1
- Date: Wed, 16 Nov 2022 22:00:02 GMT
- Title: Artificial Disfluency Detection, Uh No, Disfluency Generation for the
Masses
- Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos and S.
Vrochidis
- Abstract summary: This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text.
LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme.
Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing approaches for disfluency detection typically require the existence
of large annotated datasets. However, current datasets for this task are
limited, suffer from class imbalance, and lack some types of disfluencies that
can be encountered in real-world scenarios. This work proposes LARD, a method
for automatically generating artificial disfluencies from fluent text. LARD can
simulate all the different types of disfluencies (repetitions, replacements and
restarts) based on the reparandum/interregnum annotation scheme. In addition,
it incorporates contextual embeddings into the disfluency generation to produce
realistic context-aware artificial disfluencies. Since the proposed method
requires only fluent text, it can be used directly for training, bypassing the
requirement of annotated disfluent data. Our empirical evaluation demonstrates
that LARD can indeed be effectively used when no or only a few data are
available. Furthermore, our detailed analysis suggests that the proposed method
generates realistic disfluencies and increases the accuracy of existing
disfluency detectors.
Related papers
- YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection [5.42845980208244]
YOLO-Stutter is a first end-to-end method that detects dysfluencies in a time-accurate manner.
VCTK-Stutter and VCTK-TTS simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation.
arXiv Detail & Related papers (2024-08-27T11:31:12Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Boosting Disfluency Detection with Large Language Model as Disfluency Generator [8.836888435915077]
We propose a lightweight data augmentation approach for disfluency detection.
We leverage large language model (LLM) to generate disfluent sentences as augmentation data.
We apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences.
arXiv Detail & Related papers (2024-03-13T04:14:33Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - An Invariant Learning Characterization of Controlled Text Generation [25.033675230270212]
Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest.
We show that the performance of controlled generation may be poor if the distributions of text in response to user prompts differ from the distribution the predictor was trained on.
arXiv Detail & Related papers (2023-05-31T21:35:08Z) - LARD: Large-scale Artificial Disfluency Generation [0.0]
We propose LARD, a method for generating complex and realistic artificial disfluencies with little effort.
The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts.
We release a new large-scale dataset with disfluencies that can be used on four different tasks.
arXiv Detail & Related papers (2022-01-13T16:02:36Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z) - Overcoming the curse of dimensionality with Laplacian regularization in
semi-supervised learning [80.20302993614594]
We provide a statistical analysis to overcome drawbacks of Laplacian regularization.
We unveil a large body of spectral filtering methods that exhibit desirable behaviors.
We provide realistic computational guidelines in order to make our method usable with large amounts of data.
arXiv Detail & Related papers (2020-09-09T14:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.