LegalRikai: Open Benchmark - A Benchmark for Complex Japanese Corporate Legal Tasks
- URL: http://arxiv.org/abs/2512.11297v2
- Date: Mon, 15 Dec 2025 11:07:12 GMT
- Title: LegalRikai: Open Benchmark - A Benchmark for Complex Japanese Corporate Legal Tasks
- Authors: Shogo Fujita, Yuji Naraki, Yiqing Zhu, Shinsuke Mori,
- Abstract summary: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices.<n>This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria.
- Score: 2.399077824457897
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.
Related papers
- LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence [3.1504461481102926]
We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments.<n>We construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations.<n>In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence.
arXiv Detail & Related papers (2026-03-02T09:31:05Z) - PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z) - Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts [0.0]
Generative AI models like OpenAI GPT-4.1 Mini, Gemini 2.0 Flash, Meta Llama 3 70B, and DeepSeek R1 could potentially democratize legal assistance.<n>In this study, we collected 250 authentic legal questions from the Facebook group "Know Your Rights"<n>We evaluated each AI-generated response across four critical dimensions: factual accuracy, legal appropriateness, completeness, and clarity.
arXiv Detail & Related papers (2025-11-07T02:44:00Z) - LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal [34.008574054602356]
The paper describes the structure of the exam, which includes a knowledge test on public procurement law and a written judgment.<n>Several LLMs were tested in closed-book and various Retrieval-Augmented Generation settings.<n>The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part.
arXiv Detail & Related papers (2025-11-06T09:11:20Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - Automatic Legal Writing Evaluation of LLMs [10.74636407144071]
oab-bench is a benchmark comprising 105 questions across seven areas of law from recent editions of the exam.<n>Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams.<n>Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams.
arXiv Detail & Related papers (2025-04-29T22:16:39Z) - Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.<n>We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.<n>Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models [0.0]
The proliferation of large language models (LLMs) requires robust evaluation of their alignment with local values and ethical standards.
textscLocalValueBench is a benchmark designed to assess LLMs' adherence to Australian values.
arXiv Detail & Related papers (2024-07-27T05:55:42Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.