Related papers: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

URL: http://arxiv.org/abs/2505.24826v1
Date: Fri, 30 May 2025 17:30:18 GMT
Title: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Authors: Li yunhan, Wu gengshen,
Abstract summary: We develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology.<n>Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7%$ noted at 72 billion parameters.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.

Related papers

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization [31.908590128913094]
HeuriGym is an agentic framework designed for evaluating algorithms generated by Large Language Models (LLMs)<n>We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning.<n>Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
arXiv Detail & Related papers (2025-06-09T17:46:47Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Identifying Legal Holdings with LLMs: A Systematic Study of Performance, Scale, and Memorization [0.0]
We present a suite of experiments to assess the performance of modern large language models (LLMs)<n>Our experiments demonstrate scaling effects - performance on this task improves with model size.<n>We utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious.
arXiv Detail & Related papers (2025-05-04T16:24:12Z)
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [17.432401371613903]
We propose a resource-efficient, System-2 thinking framework for code correctness evaluation.<n>MCTS-Judge uses Monte Carlo Tree Search to decompose problems into simpler, multi-perspective evaluations.<n>High-precision, unit-test-level reward mechanism encourages the Large Language Model to perform line-by-line analysis.
arXiv Detail & Related papers (2025-02-18T02:55:48Z)
SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals [51.60874286674908]
We aim to predict performance in closed-book question answering (QA), a vital downstream task indicative of a model's internal knowledge.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study [9.30538764385435]
Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored.<n>We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations.<n>We then conduct a systematic benchmarking across a range of solutions.<n>Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero.
arXiv Detail & Related papers (2024-12-09T07:46:14Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change [34.93870615625937]
PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
arXiv Detail & Related papers (2022-06-21T16:15:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.