LARD: Large-scale Artificial Disfluency Generation
- URL: http://arxiv.org/abs/2201.05041v1
- Date: Thu, 13 Jan 2022 16:02:36 GMT
- Title: LARD: Large-scale Artificial Disfluency Generation
- Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S. Vrochidis
- Abstract summary: We propose LARD, a method for generating complex and realistic artificial disfluencies with little effort.
The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts.
We release a new large-scale dataset with disfluencies that can be used on four different tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Disfluency detection is a critical task in real-time dialogue systems.
However, despite its importance, it remains a relatively unexplored field,
mainly due to the lack of appropriate datasets. At the same time, existing
datasets suffer from various issues, including class imbalance issues, which
can significantly affect the performance of the model on rare classes, as it is
demonstrated in this paper. To this end, we propose LARD, a method for
generating complex and realistic artificial disfluencies with little effort.
The proposed method can handle three of the most common types of disfluencies:
repetitions, replacements and restarts. In addition, we release a new
large-scale dataset with disfluencies that can be used on four different tasks:
disfluency detection, classification, extraction and correction. Experimental
results on the LARD dataset demonstrate that the data produced by the proposed
method can be effectively used for detecting and removing disfluencies, while
also addressing limitations of existing datasets.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Domain-invariant Clinical Representation Learning by Bridging Data
Distribution Shift across EMR Datasets [16.317118701435742]
An effective prognostic model is expected to assist doctors in making right diagnosis and designing personalized treatment plan.
In the early stage of a disease, limited data collection and clinical experiences, plus the concern out of privacy and ethics, may result in restricted data availability for reference.
This article introduces a domain-invariant representation learning method to build a transition model from source dataset to target dataset.
arXiv Detail & Related papers (2023-10-11T18:32:21Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Artificial Disfluency Detection, Uh No, Disfluency Generation for the
Masses [0.0]
This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text.
LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme.
Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data.
arXiv Detail & Related papers (2022-11-16T22:00:02Z) - Multiple Instance Learning for Detecting Anomalies over Sequential
Real-World Datasets [2.427831679672374]
Multiple Instance Learning (MIL) has been shown effective on problems with incomplete knowledge of labels in the training dataset.
We propose an MIL-based formulation and various algorithmic instantiations of this framework based on different design decisions.
The framework generalizes well over diverse datasets resulting from different real-world application domains.
arXiv Detail & Related papers (2022-10-04T16:02:09Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Meta-learning One-class Classifiers with Eigenvalue Solvers for
Supervised Anomaly Detection [55.888835686183995]
We propose a neural network-based meta-learning method for supervised anomaly detection.
We experimentally demonstrate that the proposed method achieves better performance than existing anomaly detection and few-shot learning methods.
arXiv Detail & Related papers (2021-03-01T01:43:04Z) - Out-Of-Bag Anomaly Detection [0.9449650062296822]
Data anomalies are ubiquitous in real world datasets, and can have an adverse impact on machine learning (ML) systems.
We propose a novel model-based anomaly detection method, that we call Out-of-Bag anomaly detection.
We show our method can improve the accuracy and reliability of an ML system as data pre-processing step via a case study on home valuation.
arXiv Detail & Related papers (2020-09-20T06:01:52Z) - Meta Learning for Causal Direction [29.00522306460408]
We introduce a novel generative model that allows distinguishing cause and effect in the small data setting.
We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
arXiv Detail & Related papers (2020-07-06T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.