Related papers: CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

URL: http://arxiv.org/abs/2501.00097v1
Date: Mon, 30 Dec 2024 19:00:01 GMT
Title: CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions
Authors: Mourad Heddaya, Kyle MacMillan, Anup Malani, Hongyuan Mei, Chenhao Tan,
Abstract summary: This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain.<n>We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as "syllabuses"<n>Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815.
Score: 25.82451110740322
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain that addresses the need for longer and more complex datasets for summarization evaluation. We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as "syllabuses." Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815. We also present a comprehensive evaluation of LLM-generated summaries using both automatic metrics and expert human evaluation, revealing discrepancies between these assessment methods. Our evaluation shows Mistral 7b, a smaller open-source model, outperforms larger models on most automatic metrics and successfully generates syllabus-like summaries. In contrast, human expert annotators indicate that Mistral summaries contain hallucinations. The annotators consistently rank GPT-4 summaries as clearer and exhibiting greater sensitivity and specificity. Further, we find that LLM-based evaluations are not more correlated with human evaluations than traditional automatic metrics. Furthermore, our analysis identifies specific hallucinations in generated summaries, including precedent citation errors and misrepresentations of case facts. These findings demonstrate the limitations of current automatic evaluation methods for legal summarization and highlight the critical role of human evaluation in assessing summary quality, particularly in complex, high-stakes domains. CaseSumm is available at https://huggingface.co/datasets/ChicagoHAI/CaseSumm

Related papers

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling [16.529070321280447]
This paper introduces the Criticality Prediction dataset, a new resource for evaluating the potential influence of Swiss Supreme Court decisions on future jurisprudence. Unlike existing approaches that rely on resource-intensive manual annotations, we semi-automatically derive labels leading to a much larger dataset. We evaluate several multilingual models, including fine-tuned variants and large language models, and find that fine-tuned models consistently outperform zero-shot baselines.
arXiv Detail & Related papers (2024-10-17T11:43:16Z)
LFOSum: Summarizing Long-form Opinions with Large Language Models [7.839083566878183]
This paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics. Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation. Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
arXiv Detail & Related papers (2024-10-16T20:52:39Z)
Applicability of Large Language Models and Generative Models for Legal Case Judgement Summarization [5.0645491201288495]
In recent years, generative models including abstractive summarization models and Large language models (LLMs) have gained huge popularity. In this paper, we explore the applicability of such models for legal case judgement summarization.
arXiv Detail & Related papers (2024-07-06T04:49:40Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization [29.49641083851667]
We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
arXiv Detail & Related papers (2024-02-20T18:58:49Z)
Incremental Extractive Opinion Summarization Using Cover Trees [81.59625423421355]
In online marketplaces user reviews accumulate over time, and opinion summaries need to be updated periodically. In this work, we study the task of extractive opinion summarization in an incremental setting. We present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting.
arXiv Detail & Related papers (2024-01-16T02:00:17Z)
AugSumm: towards generalizable speech summarization using synthetic labels from large language model [61.73741195292997]
Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary. We propose AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries.
arXiv Detail & Related papers (2024-01-10T18:39:46Z)
Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement [3.537369004801589]
We study the classification of legal reasoning according to jurisprudential philosophy. We use a novel dataset of historical United States Supreme Court opinions annotated by a team of domain experts. We find that generative models perform poorly when given instructions equal to the instructions presented to human annotators.
arXiv Detail & Related papers (2023-10-27T19:27:59Z)
On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z)
How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization? [4.721618284417204]
In recent years, abstractive summarization models are gaining popularity. Legal domain-specific pre-trained abstractive summarization models are now available. General-domain pre-trained Large Language Models (LLMs) are known to generate high-quality text.
arXiv Detail & Related papers (2023-06-02T03:16:19Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.