Related papers: Summary Workbench: Unifying Application and Evaluation of Text Summarization Models

Summary Workbench: Unifying Application and Evaluation of Text Summarization Models

URL: http://arxiv.org/abs/2210.09587v1
Date: Tue, 18 Oct 2022 04:47:25 GMT
Title: Summary Workbench: Unifying Application and Evaluation of Text Summarization Models
Authors: Shahbaz Syed, Dominik Schwabe, Martin Potthast
Abstract summary: New models and evaluation measures can be easily integrated as Docker-based plugins. Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses.
Score: 24.40171915438056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents Summary Workbench, a new tool for developing and evaluating text summarization models. New models and evaluation measures can be easily integrated as Docker-based plugins, allowing to examine the quality of their summaries against any input and to evaluate them using various evaluation measures. Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses. The tool is hosted at \url{https://tldr.demo.webis.de} and also supports local deployment for private resources.

Related papers

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models [0.0]
Large Language Models (LLMs) have shown promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. We find that all summarization models produce consistent summaries when tested on the XL-Sum dataset.
arXiv Detail & Related papers (2025-02-28T01:58:17Z)
EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements [167.73134600289603]
evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
arXiv Detail & Related papers (2022-09-30T18:35:39Z)
Podcast Summary Assessment: A Resource for Evaluating Summary Assessment Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus. This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z)
NEWTS: A Corpus for News Topic-Focused Summarization [9.872518517174498]
This paper introduces the first topical summarization corpus, based on the well-known CNN/Dailymail dataset. We evaluate a range of existing techniques and analyze the effectiveness of different prompting methods.
arXiv Detail & Related papers (2022-05-31T10:01:38Z)
Summary Explorer: Visualizing the State of the Art in Text Summarization [23.45323725326221]
This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias) encapsulated in a guided assessment based on tailored visualizations.
arXiv Detail & Related papers (2021-08-04T07:11:19Z)
SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization [14.787106201073154]
SummVis is an open-source tool for visualizing abstractive summaries. It enables fine-grained analysis of the models, data, and evaluation metrics associated with text summarization.
arXiv Detail & Related papers (2021-04-15T17:13:00Z)
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning. Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT. Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
Few-Shot Learning for Opinion Summarization [117.70510762845338]
Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents. In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text. Our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.
arXiv Detail & Related papers (2020-04-30T15:37:38Z)
Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.