A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets
- URL: http://arxiv.org/abs/2409.05972v1
- Date: Mon, 9 Sep 2024 18:10:05 GMT
- Title: A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets
- Authors: Mariana Yukari Noguti, Edduardo Vellasques, Luiz Eduardo Soares Oliveira,
- Abstract summary: This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data.
We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects.
The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in language modelling has significantly decreased the need of labelled data in text classification tasks. Transformer-based models, pre-trained on unlabeled data, can outmatch the performance of models trained from scratch for each task. However, the amount of labelled data need to fine-tune such type of model is still considerably high for domains requiring expert-level annotators, like the legal domain. This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data and perform a classification task in the legal area with 50 predefined topics. More specifically, we use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects, which currently demands deep legal knowledge for manual filling. The task of optimizing the performance of classifiers in this scenario is especially challenging, given the low amount of resources available regarding the Portuguese language, especially in the legal domain. Our results demonstrate that classic supervised models such as logistic regression and SVM and the ensembles random forest and gradient boosting achieve better performance along with embeddings extracted with word2vec when compared to BERT language model. The latter demonstrates superior performance in association with the architecture of the model itself as a classifier, having surpassed all previous models in that regard. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning, with an accuracy of 80.7% in the aforementioned task.
Related papers
- TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text [5.523385345486362]
We have developed language models specifically designed for legal applications.
Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text.
arXiv Detail & Related papers (2024-10-28T19:32:18Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Language Models are Graph Learners [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, including Graph Neural Networks (GNNs) and Graph Transformers (GTs)
We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z) - Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement [3.537369004801589]
We study the classification of legal reasoning according to jurisprudential philosophy.
We use a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts.
We find that generative models perform poorly when given instructions equal to the instructions presented to human annotators.
arXiv Detail & Related papers (2023-10-27T19:27:59Z) - Attention is Not Always What You Need: Towards Efficient Classification
of Domain-Specific Text [1.1508304497344637]
For large-scale IT corpora with hundreds of classes organized in a hierarchy, the task of accurate classification of classes at the higher level in the hierarchies is crucial.
In the business world, an efficient and explainable ML model is preferred over an expensive black-box model, especially if the performance increase is marginal.
Despite the widespread use of PLMs, there is a lack of a clear and well-justified need to as why these models are being employed for domain-specific text classification.
arXiv Detail & Related papers (2023-03-31T03:17:23Z) - An Efficient Active Learning Pipeline for Legal Text Classification [2.462514989381979]
We propose a pipeline for effectively using active learning with pre-trained language models in the legal domain.
We use knowledge distillation to guide the model's embeddings to a semantically meaningful space.
Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies.
arXiv Detail & Related papers (2022-11-15T13:07:02Z) - Teacher Guided Training: An Efficient Framework for Knowledge Transfer [86.6784627427194]
We propose the teacher-guided training (TGT) framework for training a high-quality compact model.
TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain.
We find that TGT can improve accuracy on several image classification benchmarks and a range of text classification and retrieval tasks.
arXiv Detail & Related papers (2022-08-14T10:33:58Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language
Model Pre-Training [0.0]
Large Transformer-based language models such as BERT have led to broad performance improvements on many NLP tasks.
In legal NLP, BERT-based models have led to new state-of-the-art results on multiple tasks.
We show that lightweight LSTM-based Language Models are able to capture enough information from a small legal text pretraining corpus and achieve excellent performance on short legal text classification tasks.
arXiv Detail & Related papers (2021-09-02T14:45:04Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.