Exploring Selective Retrieval-Augmentation for Long-Tail Legal Text Classification
- URL: http://arxiv.org/abs/2508.19997v3
- Date: Fri, 29 Aug 2025 04:34:10 GMT
- Title: Exploring Selective Retrieval-Augmentation for Long-Tail Legal Text Classification
- Authors: Boheng Mao,
- Abstract summary: This paper explores Selective Retrieval-Augmentation (SRA) as a proof-of-concept approach to this problem.<n>SRA focuses on augmenting samples belonging to low-frequency labels in the training set.<n>SRA achieves consistent gains in both micro-F1 and macro-F1 over LexGLUE baselines.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper explores Selective Retrieval-Augmentation (SRA) as a proof-of-concept approach to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. SRA is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). Results show that SRA achieves consistent gains in both micro-F1 and macro-F1 over LexGLUE baselines.
Related papers
- Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization [6.696520328216944]
Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones.<n>Recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision.<n>We propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification.
arXiv Detail & Related papers (2025-06-03T00:12:39Z) - From Citations to Criticality: Predicting Legal Decision Influence in the Multilingual Swiss Jurisprudence [16.529070321280447]
We introduce the Criticality Prediction dataset, a novel resource for evaluating case prioritization.<n>Our dataset features a two-tier labeling system: (1) the binary LD-Label, identifying cases published as Leading Decisions (LD), and (2) the more granular Citation-Label, ranking cases by their citation frequency and recency.<n>We evaluate several multilingual models, including both smaller fine-tuned models and large language models in a zero-shot setting.
arXiv Detail & Related papers (2024-10-17T11:43:16Z) - Class-aware and Augmentation-free Contrastive Learning from Label Proportion [19.41511190742059]
Learning from Label Proportion (LLP) is a weakly supervised learning scenario in which training data is organized into predefined bags of instances.
We propose an augmentation-free contrastive framework TabLLP-BDC that introduces class-aware supervision at the instance level.
Our solution features a two-stage Bag Difference Contrastive (BDC) learning mechanism that establishes robust class-aware instance-level supervision.
arXiv Detail & Related papers (2024-08-13T09:04:47Z) - PS-TTL: Prototype-based Soft-labels and Test-Time Learning for Few-shot Object Detection [21.443060372419286]
Few-Shot Object Detection (FSOD) has gained widespread attention and made significant progress.
We propose a new framework for FSOD, namely Prototype-based Soft-labels and Test-Time Learning (PS-TTL)
arXiv Detail & Related papers (2024-08-11T02:21:43Z) - Towards Realistic Long-tailed Semi-supervised Learning in an Open World [0.0]
We construct a more emphRealistic Open-world Long-tailed Semi-supervised Learning (textbfROLSSL) setting where there is no premise on the distribution relationships between known and novel categories.
Under the proposed ROLSSL setting, we propose a simple yet potentially effective solution called dual-stage logit adjustments.
Experiments on datasets such as CIFAR100 and ImageNet100 have demonstrated performance improvements of up to 50.1%.
arXiv Detail & Related papers (2024-05-23T12:53:50Z) - Frequency-Aware Self-Supervised Long-Tailed Learning [36.00672675332761]
We propose Frequency-Aware Self-Supervised Learning (FASSL) for learning from unlabeled data with inherent long-tailed distributions.
We first learn frequency-aware prototypes, reflecting the associated long-tailed distribution. Particularly focusing on rare-class samples, the relationships between image data and the derived prototypes are exploited.
arXiv Detail & Related papers (2023-09-09T08:57:40Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels [56.81761908354718]
We propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels.
Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline.
We further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data.
arXiv Detail & Related papers (2023-01-02T07:13:28Z) - On Non-Random Missing Labels in Semi-Supervised Learning [114.62655062520425]
Semi-Supervised Learning (SSL) is fundamentally a missing label problem.
We explicitly incorporate "class" into SSL.
Our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods.
arXiv Detail & Related papers (2022-06-29T22:01:29Z) - Cycle Label-Consistent Networks for Unsupervised Domain Adaptation [57.29464116557734]
Domain adaptation aims to leverage a labeled source domain to learn a classifier for the unlabeled target domain with a different distribution.
We propose a simple yet efficient domain adaptation method, i.e. Cycle Label-Consistent Network (CLCN), by exploiting the cycle consistency of classification label.
We demonstrate the effectiveness of our approach on MNIST-USPS-SVHN, Office-31, Office-Home and Image CLEF-DA benchmarks.
arXiv Detail & Related papers (2022-05-27T13:09:08Z) - Creating Training Sets via Weak Indirect Supervision [66.77795318313372]
Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources.
We formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels.
We develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources.
arXiv Detail & Related papers (2021-10-07T14:09:35Z) - SCARF: Self-Supervised Contrastive Learning using Random Feature
Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features.
We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z) - Neighborhood Contrastive Learning for Novel Class Discovery [79.14767688903028]
We build a new framework, named Neighborhood Contrastive Learning, to learn discriminative representations that are important to clustering performance.
We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-06-20T17:34:55Z) - PLM: Partial Label Masking for Imbalanced Multi-label Classification [59.68444804243782]
Neural networks trained on real-world datasets with long-tailed label distributions are biased towards frequent classes and perform poorly on infrequent classes.
We propose a method, Partial Label Masking (PLM), which utilizes this ratio during training.
Our method achieves strong performance when compared to existing methods on both multi-label (MultiMNIST and MSCOCO) and single-label (imbalanced CIFAR-10 and CIFAR-100) image classification datasets.
arXiv Detail & Related papers (2021-05-22T18:07:56Z) - Training image classifiers using Semi-Weak Label Data [26.04162590798731]
In Multiple Instance learning (MIL), weak labels are provided at the bag level with only presence/absence information known.
This paper introduces a novel semi-weak label learning paradigm as a middle ground to mitigate the problem.
We propose a two-stage framework to address the problem of learning from semi-weak labels.
arXiv Detail & Related papers (2021-03-19T03:06:07Z) - Hard Class Rectification for Domain Adaptation [36.58361356407803]
Domain adaptation (DA) aims to transfer knowledge from a label-rich domain (source domain) to a label-scare domain (target domain)
We propose a novel framework, called Hard Class Rectification Pseudo-labeling (HCRPL), to alleviate the hard class problem.
The proposed method is evaluated in both unsupervised domain adaptation (UDA) and semi-supervised domain adaptation (SSDA)
arXiv Detail & Related papers (2020-08-08T06:21:58Z) - Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive
Person Re-Identification [64.37745443119942]
This paper jointly enforces visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification.
Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks.
arXiv Detail & Related papers (2020-07-21T14:31:27Z) - NeuCrowd: Neural Sampling Network for Representation Learning with
Crowdsourced Labels [19.345894148534335]
We propose emphNeuCrowd, a unified framework for supervised representation learning (SRL) from crowdsourced labels.
The proposed framework is evaluated on both one synthetic and three real-world data sets.
arXiv Detail & Related papers (2020-03-21T13:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.