Resolving the Imbalance Issue in Hierarchical Disciplinary Topic
Inference via LLM-based Data Augmentation
- URL: http://arxiv.org/abs/2310.05318v2
- Date: Sun, 15 Oct 2023 03:34:40 GMT
- Title: Resolving the Imbalance Issue in Hierarchical Disciplinary Topic
Inference via LLM-based Data Augmentation
- Authors: Xunxin Cai, Meng Xiao, Zhiyuan Ning, Yuanchun Zhou
- Abstract summary: This study leverages large language models (Llama V1) as data generators to augment research proposals categorized within intricate disciplinary hierarchies.
Our experiments attest to the efficacy of the generated data, demonstrating that research proposals produced using the prompts can effectively address the aforementioned issues.
- Score: 5.98277339029019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In addressing the imbalanced issue of data within the realm of Natural
Language Processing, text data augmentation methods have emerged as pivotal
solutions. This data imbalance is prevalent in the research proposals submitted
during the funding application process. Such imbalances, resulting from the
varying popularity of disciplines or the emergence of interdisciplinary
studies, significantly impede the precision of downstream topic models that
deduce the affiliated disciplines of these proposals. At the data level,
proposals penned by experts and scientists are inherently complex technological
texts, replete with intricate terminologies, which augmenting such specialized
text data poses unique challenges. At the system level, this, in turn,
compromises the fairness of AI-assisted reviewer assignment systems, which
raises a spotlight on solving this issue. This study leverages large language
models (Llama V1) as data generators to augment research proposals categorized
within intricate disciplinary hierarchies, aiming to rectify data imbalances
and enhance the equity of expert assignments. We first sample within the
hierarchical structure to find the under-represented class. Then we designed a
prompt for keyword-based research proposal generation. Our experiments attests
to the efficacy of the generated data, demonstrating that research proposals
produced using the prompts can effectively address the aforementioned issues
and generate high quality scientific text data, thus help the model overcome
the imbalanced issue.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Context Matters: Data-Efficient Augmentation of Large Language Models
for Scientific Applications [15.893290942177112]
We explore the challenges inherent to Large Language Models (LLMs) like GPT-4.
The capacity of LLMs to present erroneous answers in a coherent and semantically rigorous manner complicates the detection of factual inaccuracies.
Our work aims to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of LLM accuracy and reliability.
arXiv Detail & Related papers (2023-12-12T08:43:20Z) - Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation [26.30701957043284]
Automated topic inference can reduce human errors caused by manual topic filling, bridge the knowledge gap between funding agencies and project applicants, and improve system efficiency.
Existing methods overlook the gap in scale between interdisciplinary research proposals and non-interdisciplinary ones, leading to an unjust phenomenon.
In this paper, we implement a topic label inference system based on a Transformer encoder-decoder architecture.
arXiv Detail & Related papers (2023-09-04T16:54:49Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Boosting Event Extraction with Denoised Structure-to-Text Augmentation [52.21703002404442]
Event extraction aims to recognize pre-defined event triggers and arguments from texts.
Recent data augmentation methods often neglect the problem of grammatical incorrectness.
We propose a denoised structure-to-text augmentation framework for event extraction DAEE.
arXiv Detail & Related papers (2023-05-16T16:52:07Z) - Is augmentation effective to improve prediction in imbalanced text
datasets? [3.1690891866882236]
We argue that adjusting the cutoffs without data augmentation can produce similar results to oversampling techniques.
Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data.
arXiv Detail & Related papers (2023-04-20T13:07:31Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - Resilient Neural Forecasting Systems [10.709321760368137]
Industrial machine learning systems face data challenges that are often under-explored in the academic literature.
In this paper, we discuss data challenges and solutions in the context of a Neural Forecasting application on labor planning.
We address changes in data distribution with a periodic retraining scheme and discuss the critical importance of model stability in this setting.
arXiv Detail & Related papers (2022-03-16T09:37:49Z) - Supercharging Imbalanced Data Learning With Energy-based Contrastive
Representation Transfer [72.5190560787569]
In computer vision, learning from long tailed datasets is a recurring theme, especially for natural image datasets.
Our proposal posits a meta-distributional scenario, where the data generating mechanism is invariant across the label-conditional feature distributions.
This allows us to leverage a causal data inflation procedure to enlarge the representation of minority classes.
arXiv Detail & Related papers (2020-11-25T00:13:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.