Summary Workbench: Unifying Application and Evaluation of Text
Summarization Models
- URL: http://arxiv.org/abs/2210.09587v1
- Date: Tue, 18 Oct 2022 04:47:25 GMT
- Title: Summary Workbench: Unifying Application and Evaluation of Text
Summarization Models
- Authors: Shahbaz Syed, Dominik Schwabe, Martin Potthast
- Abstract summary: New models and evaluation measures can be easily integrated as Docker-based plugins.
Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses.
- Score: 24.40171915438056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Summary Workbench, a new tool for developing and
evaluating text summarization models. New models and evaluation measures can be
easily integrated as Docker-based plugins, allowing to examine the quality of
their summaries against any input and to evaluate them using various evaluation
measures. Visual analyses combining multiple measures provide insights into the
models' strengths and weaknesses. The tool is hosted at
\url{https://tldr.demo.webis.de} and also supports local deployment for private
resources.
Related papers
- Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Evaluate & Evaluation on the Hub: Better Best Practices for Data and
Model Measurements [167.73134600289603]
evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models.
Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
arXiv Detail & Related papers (2022-09-30T18:35:39Z) - Podcast Summary Assessment: A Resource for Evaluating Summary Assessment
Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus.
This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z) - NEWTS: A Corpus for News Topic-Focused Summarization [9.872518517174498]
This paper introduces the first topical summarization corpus, based on the well-known CNN/Dailymail dataset.
We evaluate a range of existing techniques and analyze the effectiveness of different prompting methods.
arXiv Detail & Related papers (2022-05-31T10:01:38Z) - Summary Explorer: Visualizing the State of the Art in Text Summarization [23.45323725326221]
This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems.
The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias) encapsulated in a guided assessment based on tailored visualizations.
arXiv Detail & Related papers (2021-08-04T07:11:19Z) - SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for
Text Summarization [14.787106201073154]
SummVis is an open-source tool for visualizing abstractive summaries.
It enables fine-grained analysis of the models, data, and evaluation metrics associated with text summarization.
arXiv Detail & Related papers (2021-04-15T17:13:00Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - Few-Shot Learning for Opinion Summarization [117.70510762845338]
Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents.
In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text.
Our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.
arXiv Detail & Related papers (2020-04-30T15:37:38Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.