Related papers: SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics

SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics

URL: http://arxiv.org/abs/2007.05374v1
Date: Fri, 10 Jul 2020 13:26:37 GMT
Title: SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics
Authors: Daniel Deutsch, Dan Roth
Abstract summary: SacreROUGE is an open-source library for using and developing summarization evaluation metrics. The library provides Python wrappers around the official implementations of existing evaluation metrics. It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
Score: 74.28810048824519
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SacreROUGE, an open-source library for using and developing summarization evaluation metrics. SacreROUGE removes many obstacles that researchers face when using or developing metrics: (1) The library provides Python wrappers around the official implementations of existing evaluation metrics so they share a common, easy-to-use interface; (2) it provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments, so no additional code needs to be written for a new evaluation metric; and (3) it includes scripts for loading datasets that contain human judgments so they can easily be used for evaluation. This work describes the design of the library, including the core Metric interface, the command-line API for evaluating summarization models and metrics, and the scripts to load and reformat publicly available datasets. The development of SacreROUGE is ongoing and open to contributions from the community.

Related papers

gec-metrics: A Unified Library for Grammatical Error Correction Evaluation [13.02513034520894]
gec-metrics is a library for using and developing grammatical error correction (GEC) evaluation metrics.<n>Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation.<n>Our code is released under the MIT license and is also distributed as an installable package.
arXiv Detail & Related papers (2025-05-26T01:10:16Z)
AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning [2.325084918639609]
We introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse machine learning tasks.<n>The library implements class-specific reporting for multi-class tasks through parameters to cover all use cases.<n>Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components.
arXiv Detail & Related papers (2025-05-21T18:36:05Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words. We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z)
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models. Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z)
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements [167.73134600289603]
evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
arXiv Detail & Related papers (2022-09-30T18:35:39Z)
Document Intelligence Metrics for Visually Rich Document Evaluation [0.10499611180329803]
We introduce DI-Metrics, a Python library devoted to VRD model evaluation. We apply DI-Metrics to evaluate information extraction performance using publicly available CORD dataset.
arXiv Detail & Related papers (2022-05-23T11:55:05Z)
MISeval: a Metric Library for Medical Image Segmentation Evaluation [1.4680035572775534]
There is no universal metric library in Python for standardized and reproducible evaluation. We propose our open-source publicly available Python package MISeval: a metric library for Medical Image Evaluation.
arXiv Detail & Related papers (2022-01-23T23:06:47Z)
Scikit-dimension: a Python package for intrinsic dimension estimation [58.8599521537]
This technical note introduces textttscikit-dimension, an open-source Python package for intrinsic dimension estimation. textttscikit-dimension package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data.
arXiv Detail & Related papers (2021-09-06T16:46:38Z)
Captum: A unified and generic model interpretability library for PyTorch [49.72749684393332]
We introduce a novel, unified, open-source model interpretability library for PyTorch. The library contains generic implementations of a number of gradient and perturbation-based attribution algorithms. It can be used for both classification and non-classification models.
arXiv Detail & Related papers (2020-09-16T18:57:57Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.