Related papers: Summarisation of German Judgments in conjunction with a Class-based Evaluation

Summarisation of German Judgments in conjunction with a Class-based Evaluation

URL: http://arxiv.org/abs/2505.05947v1
Date: Fri, 09 May 2025 10:44:34 GMT
Title: Summarisation of German Judgments in conjunction with a Class-based Evaluation
Authors: Bianca Steffes, Nils Torben Wiedemann, Alexander Gratz, Pamela Hochreither, Jana Elina Meyer, Katharina Luise Schilke,
Abstract summary: We create summaries (guiding principles) of German judgments by fine-tuning a decoder-based large language model.<n>We enrich the judgments with information about legal entities before the training.<n>Our results show that employing legal entities helps the generative model to find the relevant content, but the quality of the created summaries is not yet sufficient for a use in practice.
Score: 37.69303106863453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The automated summarisation of long legal documents can be a great aid for legal experts in their daily work. We automatically create summaries (guiding principles) of German judgments by fine-tuning a decoder-based large language model. We enrich the judgments with information about legal entities before the training. For the evaluation of the created summaries, we define a set of evaluation classes which allows us to measure their language, pertinence, completeness and correctness. Our results show that employing legal entities helps the generative model to find the relevant content, but the quality of the created summaries is not yet sufficient for a use in practice.

Related papers

Aligning Language Models for Icelandic Legal Text Summarization [1.5259290787592112]
This study examines whether preference-based training techniques can enhance models' performance in generating Icelandic legal summaries.<n>Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage.
arXiv Detail & Related papers (2025-04-25T08:55:15Z)
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System [12.256518096712334]
JuDGE (Judgment Document Generation Evaluation) is a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system.<n>We construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents.<n>In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents.
arXiv Detail & Related papers (2025-03-18T13:48:18Z)
Automating Legal Concept Interpretation with LLMs: Retrieval, Generation, and Evaluation [27.345475442620746]
Legal articles often include vague concepts for adapting to the ever-changing society.<n>It requires meticulous and professional annotations and summarizations by legal experts.<n>By emulating legal experts' doctrinal method, we introduce a novel framework, ATRIE.<n>ATRIE comprises a legal concept interpreter and a legal concept interpretation evaluator.
arXiv Detail & Related papers (2025-01-03T10:11:38Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Incremental Extractive Opinion Summarization Using Cover Trees [81.59625423421355]
In online marketplaces user reviews accumulate over time, and opinion summaries need to be updated periodically. In this work, we study the task of extractive opinion summarization in an incremental setting. We present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting.
arXiv Detail & Related papers (2024-01-16T02:00:17Z)
Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement [3.537369004801589]
We study the classification of legal reasoning according to jurisprudential philosophy. We use a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts. We find that generative models perform poorly when given instructions equal to the instructions presented to human annotators.
arXiv Detail & Related papers (2023-10-27T19:27:59Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
An Evaluation Framework for Legal Document Summarization [1.9709122688953327]
A law practitioner has to go through numerous lengthy legal case proceedings for their practices of various categories, such as land dispute, corruption, etc. It is important to summarize these documents, and ensure that summaries contain phrases with intent matching the category of the case. We propose an automated intent-based summarization metric, which shows a better agreement with human evaluation as compared to other automated metrics like BLEU, ROUGE-L etc.
arXiv Detail & Related papers (2022-05-17T16:42:03Z)
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English [15.026117429782996]
We introduce the Legal General Language Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
arXiv Detail & Related papers (2021-10-03T10:50:51Z)
Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.