Unsupervised Law Article Mining based on Deep Pre-Trained Language
Representation Models with Application to the Italian Civil Code
- URL: http://arxiv.org/abs/2112.03033v1
- Date: Thu, 2 Dec 2021 11:02:00 GMT
- Title: Unsupervised Law Article Mining based on Deep Pre-Trained Language
Representation Models with Application to the Italian Civil Code
- Authors: Andrea Tagarelli, Andrea Simeri
- Abstract summary: This study proposes an advanced approach to law article prediction for the Italian legal system based on a BERT (Bidirectional Representations from Transformers) learning framework.
We define LamBERTa models by fine-tuning an Italian pre-trained BERT on the Italian civil code or its portions, for law article retrieval as a classification task.
We provide insights into the explainability and interpretability of our LamBERTa models, and we present an extensive experimental analysis over query sets of different type.
- Score: 3.9342247746757435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling law search and retrieval as prediction problems has recently emerged
as a predominant approach in law intelligence. Focusing on the law article
retrieval task, we present a deep learning framework named LamBERTa, which is
designed for civil-law codes, and specifically trained on the Italian civil
code. To our knowledge, this is the first study proposing an advanced approach
to law article prediction for the Italian legal system based on a BERT
(Bidirectional Encoder Representations from Transformers) learning framework,
which has recently attracted increased attention among deep learning
approaches, showing outstanding effectiveness in several natural language
processing and learning tasks. We define LamBERTa models by fine-tuning an
Italian pre-trained BERT on the Italian civil code or its portions, for law
article retrieval as a classification task. One key aspect of our LamBERTa
framework is that we conceived it to address an extreme classification
scenario, which is characterized by a high number of classes, the few-shot
learning problem, and the lack of test query benchmarks for Italian legal
prediction tasks. To solve such issues, we define different methods for the
unsupervised labeling of the law articles, which can in principle be applied to
any law article code system. We provide insights into the explainability and
interpretability of our LamBERTa models, and we present an extensive
experimental analysis over query sets of different type, for single-label as
well as multi-label evaluation tasks. Empirical evidence has shown the
effectiveness of LamBERTa, and also its superiority against widely used
deep-learning text classifiers and a few-shot learner conceived for an
attribute-aware prediction task.
Related papers
- A Multi-Source Heterogeneous Knowledge Injected Prompt Learning Method for Legal Charge Prediction [3.52209555388364]
We propose a prompt learning framework-based method for modeling case descriptions.
We leverage multi-source external knowledge from a legal knowledge base, a conversational LLM, and legal articles.
Our method achieves state-of-the-art results on CAIL-2018, the largest legal charge prediction dataset.
arXiv Detail & Related papers (2024-08-05T04:53:17Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - LawLLM: Law Large Language Model for the US Legal System [43.13850456765944]
We introduce the Law Large Language Model (LawLLM), a multi-task model specifically designed for the US legal domain.
LawLLM excels at Similar Case Retrieval (SCR), Precedent Case Recommendation (PCR), and Legal Judgment Prediction (LJP)
We propose customized data preprocessing techniques for each task that transform raw legal data into a trainable format.
arXiv Detail & Related papers (2024-07-27T21:51:30Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Do Language Models Learn about Legal Entity Types during Pretraining? [4.604003661048267]
We show that Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates.
Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures.
arXiv Detail & Related papers (2023-10-19T18:47:21Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - AraLegal-BERT: A pretrained language model for Arabic Legal text [0.399013650624183]
We introduce AraLegal-BERT, a bidirectional encoder Transformer-based model that have been thoroughly tested and carefully optimized.
We fine-tuned AraLegal-BERT and evaluated it against three BERT variations for Arabic language in three natural languages understanding (NLU) tasks.
The results show that the base version of AraLegal-BERT achieve better accuracy than the general and original BERT over the Legal text.
arXiv Detail & Related papers (2022-10-15T13:08:40Z) - Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better
Language Models for Code Understanding [3.98345038769576]
We derive a set of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post.
We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning.
arXiv Detail & Related papers (2021-09-15T17:42:44Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.