UMSE: Unified Multi-scenario Summarization Evaluation
- URL: http://arxiv.org/abs/2305.16895v1
- Date: Fri, 26 May 2023 12:54:44 GMT
- Title: UMSE: Unified Multi-scenario Summarization Evaluation
- Authors: Shen Gao, Zhitao Yao, Chongyang Tao, Xiuying Chen, Pengjie Ren,
Zhaochun Ren and Zhumin Chen
- Abstract summary: Summarization quality evaluation is a non-trivial task in text summarization.
We propose Unified Multi-scenario Summarization Evaluation Model (UMSE)
Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
- Score: 52.60867881867428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Summarization quality evaluation is a non-trivial task in text summarization.
Contemporary methods can be mainly categorized into two scenarios: (1)
reference-based: evaluating with human-labeled reference summary; (2)
reference-free: evaluating the summary consistency of the document. Recent
studies mainly focus on one of these scenarios and explore training neural
models built on PLMs to align with human criteria. However, the models from
different scenarios are optimized individually, which may result in sub-optimal
performance since they neglect the shared knowledge across different scenarios.
Besides, designing individual models for each scenario caused inconvenience to
the user. Inspired by this, we propose Unified Multi-scenario Summarization
Evaluation Model (UMSE). More specifically, we propose a perturbed prefix
tuning method to share cross-scenario knowledge between scenarios and use a
self-supervised training paradigm to optimize the model without extra human
labeling. Our UMSE is the first unified summarization evaluation framework
engaged with the ability to be used in three evaluation scenarios. Experimental
results across three typical scenarios on the benchmark dataset SummEval
indicate that our UMSE can achieve comparable performance with several existing
strong methods which are specifically designed for each scenario.
Related papers
- FEET: A Framework for Evaluating Embedding Techniques [0.5837446811360741]
FEET is a standardized protocol designed to guide the development and benchmarking of foundation models.
We define three primary use cases: frozen embeddings, few-shot embeddings, and fully fine-tuned embeddings.
arXiv Detail & Related papers (2024-11-02T18:03:49Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Semi-Supervised Dialogue Abstractive Summarization via High-Quality
Pseudolabel Selection [27.531083525683243]
Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data.
We propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality.
arXiv Detail & Related papers (2024-03-06T22:06:23Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot
Summarization [54.59104881168188]
textscUniSumm is a unified few-shot summarization model pre-trained with multiple summarization tasks.
textscSummZoo is a new benchmark to better evaluate few-shot summarizers.
arXiv Detail & Related papers (2022-11-17T18:54:47Z) - An Information-Theoretic Approach for Estimating Scenario Generalization
in Crowd Motion Prediction [27.10815774845461]
We propose a novel scoring method, which characterizes generalization of models trained on source crowd scenarios and applied to target crowd scenarios.
The Interaction component aims to characterize the difficulty of scenario domains, while the diversity of a scenario domain is captured in the Diversity score.
Our experimental results validate the efficacy of the proposed method on several simulated and real-world (source,target) generalization tasks.
arXiv Detail & Related papers (2022-11-02T01:39:30Z) - Scenario-Adaptive and Self-Supervised Model for Multi-Scenario
Personalized Recommendation [35.4495536683099]
We propose a scenario-Adaptive and Self-Supervised (SASS) model to solve the three challenges mentioned above.
The model is created symmetrically both in user side and item side, so that we can get distinguishing representations of items in different scenarios.
This model also achieves more than 8.0% improvement on Average Watching Time Per User in online A/B tests.
arXiv Detail & Related papers (2022-08-24T11:44:00Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.