AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments
- URL: http://arxiv.org/abs/2603.04718v1
- Date: Thu, 05 Mar 2026 01:45:28 GMT
- Title: AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments
- Authors: Kylie Zhang, Nimra Nadeem, Lucia Zheng, Dominik Stammbach, Peter Henderson,
- Abstract summary: We examine whether AI models can effectively simulate justice-specific questioning for moot court-style training.<n>We construct and evaluate both prompt-based and agentic oral argument simulators.<n>We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues.
- Score: 7.808898285349819
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.
Related papers
- LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z) - Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics [49.3262123849242]
We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset.<n>We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces.
arXiv Detail & Related papers (2025-11-30T18:32:43Z) - Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments [2.8622281002418357]
Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit.<n>This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges.<n>Our investigation in English shows that models do not provide stable interpretive judgments.
arXiv Detail & Related papers (2025-10-29T10:21:25Z) - Do LLMs Truly Understand When a Precedent Is Overruled? [3.5784933879188796]
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks.<n>We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases.
arXiv Detail & Related papers (2025-10-23T19:07:42Z) - Judicial Requirements for Generative AI in Legal Reasoning [0.0]
Large Language Models (LLMs) are being integrated into professional domains, yet their limitations in high-stakes fields like law remain poorly understood.<n>This paper defines the core capabilities that an AI system must possess to function as a reliable reasoning tool in judicial decision-making.
arXiv Detail & Related papers (2025-08-26T09:56:26Z) - LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z) - Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning [17.829990749622496]
Reasoning Court (RC) is a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge.<n>RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.
arXiv Detail & Related papers (2025-04-14T00:56:08Z) - AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction [56.797874973414636]
AnnoCaseLaw is a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases.<n>Our dataset lays the groundwork for more human-aligned, explainable Legal Judgment Prediction models.<n>Results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult.
arXiv Detail & Related papers (2025-02-28T19:14:48Z) - On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z) - Legal Judgment Prediction with Multi-Stage CaseRepresentation Learning
in the Real Court Setting [25.53133777558123]
We introduce a novel dataset from real courtrooms to predict the legal judgment in a reasonably encyclopedic manner.
An extensive set of experiments with a large civil trial data set shows that the proposed model can more accurately characterize the interactions among claims, fact and debate for legal judgment prediction.
arXiv Detail & Related papers (2021-07-12T04:27:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.