Related papers: Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

URL: http://arxiv.org/abs/2406.09206v1
Date: Thu, 13 Jun 2024 15:06:11 GMT
Title: Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models
Authors: Christopher Schröder, Gerhard Heyer,
Abstract summary: We investigate how self-training can be used to improve the efficiency of active learning for text classification. We devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks.
Score: 3.546617486894182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.

Related papers

Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels. By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z)
Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z)
Towards Efficient Active Learning in NLP via Pretrained Representations [1.90365714903665]
Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. We drastically expedite this process by using pretrained representations of LLMs within the active learning loop. Our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive.
arXiv Detail & Related papers (2024-02-23T21:28:59Z)
Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems. We focus on achieving language adaptation using minimal labeled and unlabeled data. Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z)
Iterative Loop Learning Combining Self-Training and Active Learning for Domain Adaptive Semantic Segmentation [1.827510863075184]
Self-training and active learning have been proposed to alleviate this problem. This paper proposes an iterative loop learning method combining Self-Training and Active Learning.
arXiv Detail & Related papers (2023-01-31T01:31:43Z)
Active Self-Training for Weakly Supervised 3D Scene Semantic Segmentation [17.27850877649498]
We introduce a method for weakly supervised segmentation of 3D scenes that combines self-training and active learning. We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous works and baselines.
arXiv Detail & Related papers (2022-09-15T06:00:25Z)
Self-Training: A Survey [5.772546394254112]
Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. We present self-training methods for binary and multi-class classification; as well as their variants and two related approaches.
arXiv Detail & Related papers (2022-02-24T11:40:44Z)
SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z)
Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning. We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data. Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)
Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.