RISE: Leveraging Retrieval Techniques for Summarization Evaluation
- URL: http://arxiv.org/abs/2212.08775v2
- Date: Mon, 22 May 2023 16:53:58 GMT
- Title: RISE: Leveraging Retrieval Techniques for Summarization Evaluation
- Authors: David Uthus and Jianmo Ni
- Abstract summary: We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval.
RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries.
We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation.
- Score: 3.9215337270154995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating automatically-generated text summaries is a challenging task.
While there have been many interesting approaches, they still fall short of
human evaluations. We present RISE, a new approach for evaluating summaries by
leveraging techniques from information retrieval. RISE is first trained as a
retrieval task using a dual-encoder retrieval setup, and can then be
subsequently utilized for evaluating a generated summary given an input
document, without gold reference summaries. RISE is especially well suited when
working on new datasets where one may not have reference summaries available
for evaluation. We conduct comprehensive experiments on the SummEval benchmark
(Fabbri et al., 2021) and the results show that RISE has higher correlation
with human evaluations compared to many past approaches to summarization
evaluation. Furthermore, RISE also demonstrates data-efficiency and
generalizability across languages.
Related papers
- Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods [0.0]
In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies.
A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs.
The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy and relevance.
arXiv Detail & Related papers (2024-09-13T02:08:47Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Comparing Methods for Extractive Summarization of Call Centre Dialogue [77.34726150561087]
We experimentally compare several such methods by using them to produce summaries of calls, and evaluating these summaries objectively.
We found that TopicSum and Lead-N outperform the other summarisation methods, whilst BERTSum received comparatively lower scores in both subjective and objective evaluations.
arXiv Detail & Related papers (2022-09-06T13:16:02Z) - Podcast Summary Assessment: A Resource for Evaluating Summary Assessment
Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus.
This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.