ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification
- URL: http://arxiv.org/abs/2502.05923v1
- Date: Sun, 09 Feb 2025 14:39:01 GMT
- Title: ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification
- Authors: Yashwanth M., Vaibhav Singh, Ayush Maheshwari, Amrith Krishna, Ganesh Ramakrishnan,
- Abstract summary: ARISE is a framework that iteratively induces rules and generates synthetic data for text classification.<n>We induce rules via inductive generalisation of syntactic n-grams, enabling us to capture a complementary source of supervision.
- Score: 27.023332376571677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ARISE, a framework that iteratively induces rules and generates synthetic data for text classification. We combine synthetic data generation and automatic rule induction, via bootstrapping, to iteratively filter the generated rules and data. We induce rules via inductive generalisation of syntactic n-grams, enabling us to capture a complementary source of supervision. These rules alone lead to performance gains in both, in-context learning (ICL) and fine-tuning (FT) settings. Similarly, use of augmented data from ARISE alone improves the performance for a model, outperforming configurations that rely on complex methods like contrastive learning. Further, our extensive experiments on various datasets covering three full-shot, eight few-shot and seven multilingual variant settings demonstrate that the rules and data we generate lead to performance improvements across these diverse domains and languages.
Related papers
- Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.
Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.
We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.
Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai [5.670682861458055]
We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai.
We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context.
Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation.
arXiv Detail & Related papers (2024-11-23T07:50:59Z) - Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks.
This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications.<n>We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z) - On Adversarial Robustness of Synthetic Code Generation [1.2559148369195197]
This paper showcases the existence of significant dataset bias through different classes of adversarial examples.
We propose several dataset augmentation techniques to reduce bias and showcase their efficacy.
arXiv Detail & Related papers (2021-06-22T09:37:48Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.