Iterative Data Generation with Large Language Models for Aspect-based Sentiment Analysis
- URL: http://arxiv.org/abs/2407.00341v2
- Date: Mon, 30 Sep 2024 10:33:37 GMT
- Title: Iterative Data Generation with Large Language Models for Aspect-based Sentiment Analysis
- Authors: Qihuang Zhong, Haiyun Li, Luyao Zhuang, Juhua Liu, Bo Du,
- Abstract summary: We propose a systematic Iterative Data Generation framework, namely IDG, to boost the performance of ABSA.
The core of IDG is to make full use of the powerful abilities (i.e., instruction-following, in-context learning and self-reflection) of LLMs to iteratively generate more fluent and diverse pseudo-label data.
IDG brings consistent and significant performance gains among five baseline ABSA models.
- Score: 39.57537769578304
- License:
- Abstract: Aspect-based Sentiment Analysis (ABSA) is an important sentiment analysis task, which aims to determine the sentiment polarity towards an aspect in a sentence. Due to the expensive and limited labeled data, data generation (DG) has become the standard for improving the performance of ABSA. However, current DG methods usually have some shortcomings: 1) poor fluency and coherence, 2) lack of diversity of generated data, and 3) reliance on some existing labeled data, hindering its applications in real-world scenarios. With the advancement of large language models (LLMs), LLM-based DG has the potential to solve the above issues. Unfortunately, directly prompting LLMs struggles to generate the desired pseudo-label ABSA data, as LLMs are prone to hallucinations, leading to undesired data generation. To this end, we propose a systematic Iterative Data Generation framework, namely IDG, to boost the performance of ABSA. The core of IDG is to make full use of the powerful abilities (i.e., instruction-following, in-context learning and self-reflection) of LLMs to iteratively generate more fluent and diverse pseudo-label data, starting from an unsupervised sentence corpus. Specifically, IDG designs a novel iterative data generation mechanism and a self-reflection data filtering module to tackle the challenges of unexpected data generation caused by hallucinations. Extensive experiments on four widely-used ABSA benchmarks show that IDG brings consistent and significant performance gains among five baseline ABSA models. More encouragingly, the synthetic data generated by IDG can achieve comparable or even better performance against the manually annotated data.
Related papers
- Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset.
Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z) - Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation [13.120801609024147]
retrieval augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs.
RAG inputs are more complex than most datasets used for training NLI models.
We introduce Automatic Generative Domain Adaptation (Auto-GDA) to enable unsupervised domain adaptation.
arXiv Detail & Related papers (2024-10-04T14:21:27Z) - UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models [88.16197692794707]
UniGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.
To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature.
Extensive experiments demonstrate the superior quality of data generated by UniGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations.
Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z) - Improving Pseudo-labelling and Enhancing Robustness for Semi-Supervised Domain Generalization [7.9776163947539755]
We study the problem of Semi-Supervised Domain Generalization which is crucial for real-world applications like automated healthcare.
We propose new SSDG approach, which utilizes a novel uncertainty-guided pseudo-labelling with model averaging.
Our uncertainty-guided pseudo-labelling (UPL) uses model uncertainty to improve pseudo-labelling selection, addressing poor model calibration under multi-source unlabelled data.
arXiv Detail & Related papers (2024-01-25T05:55:44Z) - AART: AI-Assisted Red-Teaming with Diverse Data Generation for New
LLM-powered Applications [5.465142671132731]
Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment.
We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications.
We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts.
arXiv Detail & Related papers (2023-11-14T23:28:23Z) - Targeted Data Generation: Finding and Fixing Model Weaknesses [6.9649605149785465]
Even when aggregate accuracy is high, state-of-the-art NLP models often fail systematically on specific subgroups of data.
We propose Targeted Data Generation (TDG), a framework that automatically identifies challenging subgroups.
In experiments, TDG significantly improves the accuracy on challenging subgroups for state-of-the-art sentiment analysis and natural language inference models.
arXiv Detail & Related papers (2023-05-28T19:36:50Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.