Large Scale Legal Text Classification Using Transformer Models
- URL: http://arxiv.org/abs/2010.12871v1
- Date: Sat, 24 Oct 2020 11:03:01 GMT
- Title: Large Scale Legal Text Classification Using Transformer Models
- Authors: Zein Shaheen, Gerhard Wohlgenannt, Erwin Filtz
- Abstract summary: We study the performance of transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates.
WeLEX quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large multi-label text classification is a challenging Natural Language
Processing (NLP) problem that is concerned with text classification for
datasets with thousands of labels. We tackle this problem in the legal domain,
where datasets, such as JRC-Acquis and EURLEX57K labeled with the EuroVoc
vocabulary were created within the legal information systems of the European
Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we
study the performance of various recent transformer-based models in combination
with strategies such as generative pretraining, gradual unfreezing and
discriminative learning rates in order to reach competitive classification
performance, and present new state-of-the-art results of 0.661 (F1) for
JRC-Acquis and 0.754 for EURLEX57K. Furthermore, we quantify the impact of
individual steps, such as language model fine-tuning or gradual unfreezing in
an ablation study, and provide reference dataset splits created with an
iterative stratification algorithm.
Related papers
- Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs [1.6575279044457722]
This paper proposes a Clustering, Labeling, then Augmenting framework that enhances performance in Semi-Supervised Text Classification tasks.
Unlike traditional SSTC approaches, this framework employs clustering to select representative "landmarks" for labeling.
Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset.
arXiv Detail & Related papers (2024-11-09T13:17:39Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets [0.0]
This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data.
We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects.
The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
arXiv Detail & Related papers (2024-09-09T18:10:05Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - IDoFew: Intermediate Training Using Dual-Clustering in Language Models
for Few Labels Text Classification [24.11420537250414]
Bidirectional Representations from Transformers (BERT) have been very effective in various Natural Language Processing (NLP) and text mining tasks including text classification.
Some tasks still pose challenges for these models, including text classification with limited labels.
We have developed a novel two-stage intermediate clustering with subsequent fine-tuning that models the pseudo-labels reliably.
arXiv Detail & Related papers (2024-01-08T17:07:37Z) - Transductive Learning for Textual Few-Shot Classification in API-based
Embedding Models [46.79078308022975]
Few-shot classification involves training a model to perform a new classification task with a handful of labeled data.
We introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints.
We propose a transductive inference, a learning paradigm that has been overlooked by the NLP community.
arXiv Detail & Related papers (2023-10-21T12:47:10Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Novel Class Discovery in Semantic Segmentation [104.30729847367104]
We introduce a new setting of Novel Class Discovery in Semantic (NCDSS)
It aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes.
In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image.
We propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels.
arXiv Detail & Related papers (2021-12-03T13:31:59Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.