SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics
- URL: http://arxiv.org/abs/2007.05374v1
- Date: Fri, 10 Jul 2020 13:26:37 GMT
- Title: SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics
- Authors: Daniel Deutsch, Dan Roth
- Abstract summary: SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
- Score: 74.28810048824519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SacreROUGE, an open-source library for using and developing
summarization evaluation metrics. SacreROUGE removes many obstacles that
researchers face when using or developing metrics: (1) The library provides
Python wrappers around the official implementations of existing evaluation
metrics so they share a common, easy-to-use interface; (2) it provides
functionality to evaluate how well any metric implemented in the library
correlates to human-annotated judgments, so no additional code needs to be
written for a new evaluation metric; and (3) it includes scripts for loading
datasets that contain human judgments so they can easily be used for
evaluation. This work describes the design of the library, including the core
Metric interface, the command-line API for evaluating summarization models and
metrics, and the scripts to load and reformat publicly available datasets. The
development of SacreROUGE is ongoing and open to contributions from the
community.
Related papers
- BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge.
Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline.
In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z) - Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories.
To tackle this issue, we first survey eleven similarity measurements between two categorical words.
We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - Evaluate & Evaluation on the Hub: Better Best Practices for Data and
Model Measurements [167.73134600289603]
evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models.
Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
arXiv Detail & Related papers (2022-09-30T18:35:39Z) - Document Intelligence Metrics for Visually Rich Document Evaluation [0.10499611180329803]
We introduce DI-Metrics, a Python library devoted to VRD model evaluation.
We apply DI-Metrics to evaluate information extraction performance using publicly available CORD dataset.
arXiv Detail & Related papers (2022-05-23T11:55:05Z) - MISeval: a Metric Library for Medical Image Segmentation Evaluation [1.4680035572775534]
There is no universal metric library in Python for standardized and reproducible evaluation.
We propose our open-source publicly available Python package MISeval: a metric library for Medical Image Evaluation.
arXiv Detail & Related papers (2022-01-23T23:06:47Z) - Scikit-dimension: a Python package for intrinsic dimension estimation [58.8599521537]
This technical note introduces textttscikit-dimension, an open-source Python package for intrinsic dimension estimation.
textttscikit-dimension package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface.
We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data.
arXiv Detail & Related papers (2021-09-06T16:46:38Z) - Captum: A unified and generic model interpretability library for PyTorch [49.72749684393332]
We introduce a novel, unified, open-source model interpretability library for PyTorch.
The library contains generic implementations of a number of gradient and perturbation-based attribution algorithms.
It can be used for both classification and non-classification models.
arXiv Detail & Related papers (2020-09-16T18:57:57Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.