Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis
- URL: http://arxiv.org/abs/2507.09485v1
- Date: Sun, 13 Jul 2025 04:07:07 GMT
- Title: Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis
- Authors: Junjie Liu, Yuanhe Tian, Yan Song,
- Abstract summary: Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios.<n>In this paper, we propose an LLM-based ABSA approach with training data augmentation.
- Score: 21.540505918226348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented data.In this paper, we propose an LLM-based ABSA approach with training data augmentation.Specifically, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. LLM.Experiment results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Data Efficacy for Language Model Training [29.901090317084005]
Data is fundamental to the training of language models (LM)<n>Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data.<n>This work introduces a general paradigm, DELT, for considering data efficacy in LM training.
arXiv Detail & Related papers (2025-06-26T17:59:07Z) - Towards Efficient and Effective Alignment of Large Language Models [7.853945494882636]
Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge.<n>This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation.
arXiv Detail & Related papers (2025-06-11T02:08:52Z) - Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis [3.1394848827666544]
Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios.<n>We introduce a data augmentation strategy for the aspect category sentiment analysis task.<n>We employ a post-processing technique to ensure semantic consistency between the generated sentence and the original sentence.
arXiv Detail & Related papers (2025-06-08T13:53:28Z) - Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities [3.1394848827666544]
Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities.<n>Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation.
arXiv Detail & Related papers (2025-01-31T01:50:49Z) - Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - Iterative Data Generation with Large Language Models for Aspect-based Sentiment Analysis [39.57537769578304]
We propose a systematic Iterative Data Generation framework, namely IDG, to boost the performance of ABSA.
The core of IDG is to make full use of the powerful abilities (i.e., instruction-following, in-context learning and self-reflection) of LLMs to iteratively generate more fluent and diverse pseudo-label data.
IDG brings consistent and significant performance gains among five baseline ABSA models.
arXiv Detail & Related papers (2024-06-29T07:00:37Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation [11.778576032848482]
This work enhances scientific language models by improving their inference and evaluation capabilities with minimal or no additional training.<n>We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models.<n>We propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain.
arXiv Detail & Related papers (2024-05-22T20:40:53Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data
augmentation in tExt Regression Tasks [4.102007186133394]
In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER.
We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data.
arXiv Detail & Related papers (2023-03-05T19:45:42Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.