Related papers: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

URL: http://arxiv.org/abs/2410.21306v2
Date: Tue, 25 Mar 2025 03:45:48 GMT
Title: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges
Authors: Farid Ariai, Gianluca Demartini,
Abstract summary: This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 133 after manual filtering.<n>It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts.<n>We provide an overview of NLP tasks specific to legal text, such as Legal Document Summarisation, legal Named Entity Recognition, Legal Question Answering, Legal Argument Mining, Legal Text Classification, and Legal Judgement Prediction.
Score: 4.548047308860141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural Language Processing (NLP) is revolutionising the way legal professionals and laypersons operate in the legal field. The considerable potential for NLP in the legal sector, especially in developing computational tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 133 after manual filtering. It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and limited open legal datasets. We provide an overview of NLP tasks specific to legal text, such as Legal Document Summarisation, legal Named Entity Recognition, Legal Question Answering, Legal Argument Mining, Legal Text Classification, and Legal Judgement Prediction. In the section on legal Language Models (LMs), we analyse both developed LMs and approaches for adapting general LMs to the legal domain. Additionally, we identify 16 Open Research Challenges, including bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.

Related papers

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z)
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation [56.79698529022327]
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution.<n>This paper explores the problem of legal claim generation based on the given case's facts.<n>We construct ClaimGen-CN, the first dataset for Chinese legal claim generation task.
arXiv Detail & Related papers (2025-08-24T07:19:25Z)
VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering [4.546567493379192]
We introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain.<n>We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness.
arXiv Detail & Related papers (2025-07-26T16:26:50Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [61.344330783528015]
LEXam is a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament [0.0]
The research contributes to the advancement of NLP in the legal field, particularly in the Polish language.<n>It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.
arXiv Detail & Related papers (2025-03-15T12:10:20Z)
LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Legal Evalutions and Challenges of Large Language Models [42.51294752406578]
We use the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain.
arXiv Detail & Related papers (2024-11-15T12:23:12Z)
LawLLM: Law Large Language Model for the US Legal System [43.13850456765944]
We introduce the Law Large Language Model (LawLLM), a multi-task model specifically designed for the US legal domain. LawLLM excels at Similar Case Retrieval (SCR), Precedent Case Recommendation (PCR), and Legal Judgment Prediction (LJP) We propose customized data preprocessing techniques for each task that transform raw legal data into a trainable format.
arXiv Detail & Related papers (2024-07-27T21:51:30Z)
InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z)
Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation [5.646219481667151]
This paper introduces a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings. We introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements. We also present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system.
arXiv Detail & Related papers (2024-05-17T11:22:27Z)
Towards A Structured Overview of Use Cases for Natural Language Processing in the Legal Domain: A German Perspective [43.662441393491584]
In recent years, the field of Legal Tech has risen in prevalence, as the Natural Language Processing (NLP) and legal disciplines have combined forces to digitalize legal processes. In this work, we aim to build a structured overview of Legal Tech use cases, grounded in NLP literature, but also supplemented by voices from legal practice in Germany.
arXiv Detail & Related papers (2024-04-29T14:56:47Z)
Exploring the Nexus of Large Language Models and Legal Systems: A Short Survey [1.0770079992809338]
The capabilities of Large Language Models (LLMs) are increasingly demonstrating unique roles in the legal sector. This survey delves into the synergy between LLMs and the legal system, such as their applications in tasks like legal text comprehension, case retrieval, and analysis. The survey showcases the latest advancements in fine-tuned legal LLMs tailored for various legal systems, along with legal datasets available for fine-tuning LLMs in various languages.
arXiv Detail & Related papers (2024-04-01T08:35:56Z)
DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval. We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability. Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z)
Enhancing Pre-Trained Language Models with Sentence Position Embeddings for Rhetorical Roles Recognition in Legal Opinions [0.16385815610837165]
The size of legal opinions continues to grow, making it increasingly challenging to develop a model that can accurately predict the rhetorical roles of legal opinions. We propose a novel model architecture for automatically predicting rhetorical roles using pre-trained language models (PLMs) enhanced with knowledge of sentence position information. Based on an annotated corpus from the LegalEval@SemEval2023 competition, we demonstrate that our approach requires fewer parameters, resulting in lower computational costs.
arXiv Detail & Related papers (2023-10-08T20:33:55Z)
Towards Grammatical Tagging for the Legal Language of Cybersecurity [0.0]
Legal language can be understood as the language typically used by those engaged in the legal profession. Recent legislation on cybersecurity obviously uses legal language in writing. This paper faces the challenge of the essential interpretation of the legal language of cybersecurity.
arXiv Detail & Related papers (2023-06-29T15:39:20Z)
SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system. Most existing language models have difficulty understanding the long-distance dependencies between different structures. We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z)
A Short Survey of Viewing Large Language Models in Legal Aspect [0.0]
Large language models (LLMs) have transformed many fields, including natural language processing, computer vision, and reinforcement learning. The integration of LLMs into the legal field has also raised several legal problems, including privacy concerns, bias, and explainability.
arXiv Detail & Related papers (2023-03-16T08:01:22Z)
Language Models as Inductive Reasoners [125.99461874008703]
We propose a new paradigm (task) for inductive reasoning, which is to induce natural language rules from natural language facts. We create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. We provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts.
arXiv Detail & Related papers (2022-12-21T11:12:14Z)
An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z)
The Legal Argument Reasoning Task in Civil Procedure [2.079168053329397]
We present a new NLP task and dataset from the domain of the U.S. civil procedure. Each instance of the dataset consists of a general introduction to the case, a particular question, and a possible solution argument.
arXiv Detail & Related papers (2022-11-05T17:41:00Z)
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English [15.026117429782996]
We introduce the Legal General Language Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
arXiv Detail & Related papers (2021-10-03T10:50:51Z)
Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding. We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)
On the Ethical Limits of Natural Language Processing on Legal Text [9.147707153504117]
We argue that researchers struggle when it comes to identifying ethical limits to using natural language processing systems. We place emphasis on three crucial normative parameters which have, to the best of our knowledge, been underestimated by current debates. For each of these three parameters we provide specific recommendations for the legal NLP community.
arXiv Detail & Related papers (2021-05-06T15:22:24Z)
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering [37.66486350122862]
This paper investigates the performance of natural language understanding approaches on statutory reasoning. We introduce a dataset, together with a legal-domain text corpus. We contrast this with a hand-constructed Prolog-based system, designed to fully solve the task.
arXiv Detail & Related papers (2020-05-11T16:54:42Z)
How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence [81.04070052740596]
Legal Artificial Intelligence (LegalAI) focuses on applying the technology of artificial intelligence, especially natural language processing, to benefit tasks in the legal domain. This paper introduces the history, the current state, and the future directions of research in LegalAI.
arXiv Detail & Related papers (2020-04-25T14:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.