Related papers: LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

URL: http://arxiv.org/abs/2510.07243v1
Date: Wed, 08 Oct 2025 17:10:47 GMT
Title: LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation
Authors: Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu,
Abstract summary: This paper introduces a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers.<n>We show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement.
Score: 6.783926395409993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

Related papers

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z)
PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z)
Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions [34.91946661563455]
Large language models (LLMs) are being increasingly integrated into legal applications.<n>This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice.
arXiv Detail & Related papers (2026-01-21T18:51:37Z)
GLARE: Agentic Reasoning for Legal Judgment Prediction [60.13483016810707]
Legal judgment prediction (LJP) has become increasingly important in the legal field.<n>Existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge.<n>We introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules.
arXiv Detail & Related papers (2025-08-22T13:38:12Z)
LLMs for Legal Subsumption in German Employment Contracts [3.3916160303055567]
This study explores the use of Large Language Models and in-context learning to evaluate the legality of clauses in German employment contracts.<n>Our work evaluates the ability of different LLMs to classify clauses as "valid," "unfair," or "void" under three legal context variants.<n>Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80%.
arXiv Detail & Related papers (2025-07-02T14:07:54Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
NitiBench: A Comprehensive Study of LLM Framework Capabilities for Thai Legal Question Answering [6.789538656798745]
This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases.<n>We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions.
arXiv Detail & Related papers (2025-02-15T17:52:14Z)
LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z)
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval [16.29803062332164]
We propose a few-shot approach where large language models assist in generating expert-aligned relevance judgments.<n>The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators.<n>It also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process.
arXiv Detail & Related papers (2024-03-27T09:46:56Z)
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications. Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.