Boosting Disfluency Detection with Large Language Model as Disfluency
Generator
- URL: http://arxiv.org/abs/2403.08229v1
- Date: Wed, 13 Mar 2024 04:14:33 GMT
- Title: Boosting Disfluency Detection with Large Language Model as Disfluency
Generator
- Authors: Zhenrong Cheng, Jiayan Guo, Hao Sun, Yan Zhang
- Abstract summary: We propose a lightweight data augmentation approach for disfluency detection.
We leverage large language model (LLM) to generate disfluent sentences as augmentation data.
We apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences.
- Score: 9.653665778500454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current disfluency detection methods heavily rely on costly and scarce
human-annotated data. To tackle this issue, some approaches employ heuristic or
statistical features to generate disfluent sentences, partially improving
detection performance. However, these sentences often deviate from real-life
scenarios, constraining overall model enhancement. In this study, we propose a
lightweight data augmentation approach for disfluency detection, utilizing the
superior generative and semantic understanding capabilities of large language
model (LLM) to generate disfluent sentences as augmentation data. We leverage
LLM to generate diverse and more realistic sentences guided by specific
prompts, without the need for fine-tuning the LLM. Subsequently, we apply an
uncertainty-aware data filtering approach to improve the quality of the
generated sentences, utilized in training a small detection model for improved
performance. Experiments using enhanced data yielded state-of-the-art results.
The results showed that using a small amount of LLM-generated enhanced data can
significantly improve performance, thereby further enhancing
cost-effectiveness.
Related papers
- Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - CLAIM Your Data: Enhancing Imputation Accuracy with Contextual Large Language Models [0.18416014644193068]
This paper introduces the Contextual Language model for Accurate Imputation Method (CLAIM)
Unlike traditional imputation methods, CLAIM utilizes contextually relevant natural language descriptors to fill missing values.
Our evaluations across diverse datasets and missingness patterns reveal CLAIM's superior performance over existing imputation techniques.
arXiv Detail & Related papers (2024-05-28T00:08:29Z) - ALMol: Aligned Language-Molecule Translation LLMs through Offline Preference Contrastive Optimisation [2.296475290901356]
We focus on machine language-molecule translation and deploy a novel training approach called contrastive preference optimisation.
Our results demonstrate that our models achieve up to a 32% improvement compared to counterpart models.
arXiv Detail & Related papers (2024-05-14T13:59:24Z) - Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications [9.982616173090264]
We investigate the usage of large language models (LLMs) for data augmentation as a potential solution to the issue of NLP models making wrong predictions with high confidence during classification tasks.
For mitigation, humans or LLMs provide natural language characterizations of high confidence misclassifications to generate synthetic data, which are then used to extend the training set.
We conduct an extensive evaluation of our approach on three classification tasks and demonstrate its effectiveness in reducing the number of high confidence misclassifications.
arXiv Detail & Related papers (2024-03-26T16:49:25Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Improving Small Language Models on PubMedQA via Generative Data
Augmentation [4.96649519549027]
Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing.
Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data.
We introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation.
arXiv Detail & Related papers (2023-05-12T23:49:23Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.