Iterative Data Programming for Expanding Text Classification Corpora
- URL: http://arxiv.org/abs/2002.01412v1
- Date: Tue, 4 Feb 2020 17:12:43 GMT
- Title: Iterative Data Programming for Expanding Text Classification Corpora
- Authors: Neil Mallinar, Abhishek Shah, Tin Kam Ho, Rajendra Ugrani, Ayush Gupta
- Abstract summary: Real-world text classification tasks often require many labeled training examples that are expensive to obtain.
Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly.
We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models.
- Score: 9.152045698511506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world text classification tasks often require many labeled training
examples that are expensive to obtain. Recent advancements in machine teaching,
specifically the data programming paradigm, facilitate the creation of training
data sets quickly via a general framework for building weak models, also known
as labeling functions, and denoising them through ensemble learning techniques.
We present a fast, simple data programming method for augmenting text data sets
by generating neighborhood-based weak models with minimal supervision.
Furthermore, our method employs an iterative procedure to identify sparsely
distributed examples from large volumes of unlabeled data. The iterative data
programming techniques improve newer weak models as more labeled data is
confirmed with human-in-loop. We show empirical results on sentence
classification tasks, including those from a task of improving intent
recognition in conversational agents.
Related papers
- Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets [51.74296438621836]
We introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels.
The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation.
Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations.
arXiv Detail & Related papers (2024-08-22T15:29:08Z) - Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - Few-Shot Data-to-Text Generation via Unified Representation and
Multi-Source Learning [114.54944761345594]
We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods.
Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2023-08-10T03:09:12Z) - Leveraging Key Information Modeling to Improve Less-Data Constrained
News Headline Generation via Duality Fine-Tuning [12.443476695459553]
We propose a novel duality fine-tuning method by formally defining the probabilistic duality constraints between key information prediction and headline generation tasks.
The proposed method can capture more information from limited data, build connections between separate tasks, and is suitable for less-data constrained generation tasks.
We conduct extensive experiments to demonstrate that our method is effective and efficient to achieve improved performance in terms of language modeling metric and informativeness correctness metric on two public datasets.
arXiv Detail & Related papers (2022-10-10T07:59:36Z) - An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective.
Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Meta-Learning for Neural Relation Classification with Distant
Supervision [38.755055486296435]
We propose a meta-learning based approach, which learns to reweight noisy training data under the guidance of reference data.
Experiments on several datasets demonstrate that the reference data can effectively guide the selection of training data.
arXiv Detail & Related papers (2020-10-26T12:52:28Z) - DeCLUTR: Deep Contrastive Learning for Unsupervised Textual
Representations [4.36561468436181]
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.
Our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
arXiv Detail & Related papers (2020-06-05T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.