Automatic Legal Writing Evaluation of LLMs
- URL: http://arxiv.org/abs/2504.21202v1
- Date: Tue, 29 Apr 2025 22:16:39 GMT
- Title: Automatic Legal Writing Evaluation of LLMs
- Authors: Ramon Pires, Roseval Malaquias Junior, Rodrigo Nogueira,
- Abstract summary: oab-bench is a benchmark comprising 105 questions across seven areas of law from recent editions of the exam.<n>Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams.<n>Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams.
- Score: 10.74636407144071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.
Related papers
- Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark.<n>It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings.<n>Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z) - Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.
We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.
Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Evaluating AI-Generated Essays with GRE Analytical Writing Assessment [15.993966092824335]
This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE)
We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline.
The top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively.
arXiv Detail & Related papers (2024-10-22T21:30:58Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.