An Efficient Active Learning Pipeline for Legal Text Classification
- URL: http://arxiv.org/abs/2211.08112v1
- Date: Tue, 15 Nov 2022 13:07:02 GMT
- Title: An Efficient Active Learning Pipeline for Legal Text Classification
- Authors: Sepideh Mamooler and R\'emi Lebret and St\'ephane Massonnet and Karl
Aberer
- Abstract summary: We propose a pipeline for effectively using active learning with pre-trained language models in the legal domain.
We use knowledge distillation to guide the model's embeddings to a semantically meaningful space.
Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies.
- Score: 2.462514989381979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active Learning (AL) is a powerful tool for learning with less labeled data,
in particular, for specialized domains, like legal documents, where unlabeled
data is abundant, but the annotation requires domain expertise and is thus
expensive. Recent works have shown the effectiveness of AL strategies for
pre-trained language models. However, most AL strategies require a set of
labeled samples to start with, which is expensive to acquire. In addition,
pre-trained language models have been shown unstable during fine-tuning with
small datasets, and their embeddings are not semantically meaningful. In this
work, we propose a pipeline for effectively using active learning with
pre-trained language models in the legal domain. To this end, we leverage the
available unlabeled data in three phases. First, we continue pre-training the
model to adapt it to the downstream task. Second, we use knowledge distillation
to guide the model's embeddings to a semantically meaningful space. Finally, we
propose a simple, yet effective, strategy to find the initial set of labeled
samples with fewer actions compared to existing methods. Our experiments on
Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show
that our approach outperforms standard AL strategies, and is more efficient.
Furthermore, our pipeline reaches comparable results to the fully-supervised
approach with a small performance gap, and dramatically reduced annotation
cost. Code and the adapted data will be made available.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Towards Efficient Active Learning in NLP via Pretrained Representations [1.90365714903665]
Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications.
We drastically expedite this process by using pretrained representations of LLMs within the active learning loop.
Our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive.
arXiv Detail & Related papers (2024-02-23T21:28:59Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Active Learning for Abstractive Text Summarization [50.79416783266641]
We propose the first effective query strategy for Active Learning in abstractive text summarization.
We show that using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores.
arXiv Detail & Related papers (2023-01-09T10:33:14Z) - LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds [62.49198183539889]
We propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds.
Our method co-designs an efficient labeling process with semi/weakly supervised learning.
Our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
arXiv Detail & Related papers (2022-10-14T19:13:36Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - ALLWAS: Active Learning on Language models in WASserstein space [13.35098213857704]
In several domains, such as medicine, the scarcity of labeled training data is a common issue.
Active learning may prove helpful in these cases to boost the performance with a limited label budget.
We propose a novel method using sampling techniques based on submodular optimization and optimal transport for active learning in language models.
arXiv Detail & Related papers (2021-09-03T18:11:07Z) - Bayesian Active Learning with Pretrained Language Models [9.161353418331245]
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data.
Previous AL approaches have been limited to task-specific models that are trained from scratch at each iteration.
We introduce BALM; Bayesian Active Learning with pretrained language models.
arXiv Detail & Related papers (2021-04-16T19:07:31Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Semi-supervised Batch Active Learning via Bilevel Optimization [89.37476066973336]
We formulate our approach as a data summarization problem via bilevel optimization.
We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.
arXiv Detail & Related papers (2020-10-19T16:53:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.