Question-Answering Approach to Evaluating Legal Summaries
- URL: http://arxiv.org/abs/2309.15016v2
- Date: Mon, 18 Dec 2023 21:43:01 GMT
- Title: Question-Answering Approach to Evaluating Legal Summaries
- Authors: Huihui Xu and Kevin Ashley
- Abstract summary: GPT-4 is used to generate a set of question-answer pairs that cover main points and information in the reference summary.
GPT-4 is then used to generate answers based on the generated summary for the questions from the reference summary.
GPT-4 grades the answers from the reference summary and the generated summary.
- Score: 0.43512163406551996
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditional evaluation metrics like ROUGE compare lexical overlap between the
reference and generated summaries without taking argumentative structure into
account, which is important for legal summaries. In this paper, we propose a
novel legal summarization evaluation framework that utilizes GPT-4 to generate
a set of question-answer pairs that cover main points and information in the
reference summary. GPT-4 is then used to generate answers based on the
generated summary for the questions from the reference summary. Finally, GPT-4
grades the answers from the reference summary and the generated summary. We
examined the correlation between GPT-4 grading with human grading. The results
suggest that this question-answering approach with GPT-4 can be a useful tool
for gauging the quality of the summary.
Related papers
- GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering [0.0]
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases.
We address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems.
arXiv Detail & Related papers (2024-09-10T15:39:32Z) - Leveraging Lecture Content for Improved Feedback: Explorations with GPT-4 and Retrieval Augmented Generation [0.0]
This paper presents the use of Retrieval Augmented Generation to improve the feedback generated by Large Language Models for programming tasks.
corresponding lecture recordings were transcribed and made available to the Large Language Model GPT-4 as external knowledge source.
The purpose of this is to prevent hallucinations and to enforce the use of the technical terms and phrases from the lecture.
arXiv Detail & Related papers (2024-05-05T18:32:06Z) - AugSumm: towards generalizable speech summarization using synthetic
labels from large language model [61.73741195292997]
Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech.
conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary.
We propose AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries.
arXiv Detail & Related papers (2024-01-10T18:39:46Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks [53.936643052339]
We evaluate the reasoning abilities of text-only and multimodal versions of GPT-4.
Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
arXiv Detail & Related papers (2023-11-14T04:33:49Z) - From Sparse to Dense: GPT-4 Summarization with Chain of Density
Prompting [57.25154420382581]
A good summary should be detailed and entity-centric without being overly dense and hard to follow.
We solicit increasingly dense GPT-4 summaries with what we refer to as a Chain of Density'' prompt.
We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt.
arXiv Detail & Related papers (2023-09-08T11:31:08Z) - Argumentative Segmentation Enhancement for Legal Summarization [0.913755431537592]
GPT-3.5 is used to generate summaries based on argumentative segments.
In terms of automatic evaluation metrics, our method generates higher quality argumentative summaries.
arXiv Detail & Related papers (2023-07-11T07:29:18Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.