Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
- URL: http://arxiv.org/abs/2503.17039v1
- Date: Fri, 21 Mar 2025 10:52:20 GMT
- Title: Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
- Authors: Jeremy Barnes, Naiara Perez, Alba Bonet-Jover, Begoña Altuna,
- Abstract summary: We collect human judgments on 2,040 abstractive summaries in Basque and Spanish.<n>For each summary, annotators evaluated five criteria on a 5-point Likert scale: coherence, consistency, fluency, relevance, and 5W1H.<n>We release BASSE and our code publicly, along with the first large-scale Basque summarization dataset containing 22,525 news articles with their subheads.
- Score: 6.974741712647656
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Studies on evaluation metrics and LLM-as-a-Judge models for automatic text summarization have largely been focused on English, limiting our understanding of their effectiveness in other languages. Through our new dataset BASSE (BAsque and Spanish Summarization Evaluation), we address this situation by collecting human judgments on 2,040 abstractive summaries in Basque and Spanish, generated either manually or by five LLMs with four different prompts. For each summary, annotators evaluated five criteria on a 5-point Likert scale: coherence, consistency, fluency, relevance, and 5W1H. We use these data to reevaluate traditional automatic metrics used for evaluating summaries, as well as several LLM-as-a-Judge models that show strong performance on this task in English. Our results show that currently proprietary judge LLMs have the highest correlation with human judgments, followed by criteria-specific automatic metrics, while open-sourced judge LLMs perform poorly. We release BASSE and our code publicly, along with the first large-scale Basque summarization dataset containing 22,525 news articles with their subheads.
Related papers
- JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs [3.83467384247581]
JobResQA is a benchmark for evaluating Machine Reading (MRC) capabilities on HR-specific tasks.<n>The dataset comprises 581 QA pairs across 105 résumé-job description pairs in five languages.
arXiv Detail & Related papers (2026-01-30T17:06:59Z) - An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs)
It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
Explainable Metrics [36.52897053496835]
generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples.
We introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation.
We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset.
arXiv Detail & Related papers (2023-10-30T17:55:08Z) - BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types.
We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.