Related papers: Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

URL: http://arxiv.org/abs/2005.03510v2
Date: Mon, 2 Nov 2020 02:40:58 GMT
Title: Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization
Authors: Dongyub Lee, Myeongcheol Shin, Taesun Whang, Seungwoo Cho, Byeongil Ko, Daniel Lee, Eunggyun Kim, Jaechoon Jo
Abstract summary: We propose evaluation metrics that reflect semantic meanings of a reference summary and the original document. We then propose a method for improving the correlation of the metrics with human judgment.
Score: 6.826626737986031
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Many existing works for text summarization are generally evaluated by using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores.

Related papers

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation [25.428322811598722]
This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets.<n>We demonstrate that many popular metrics exhibit significant instability.<n>This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets.
arXiv Detail & Related papers (2025-06-17T09:17:41Z)
Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z)
Hierarchical Indexing for Retrieval-Augmented Opinion Summarization [60.5923941324953]
We propose a method for unsupervised abstractive opinion summarization that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs) Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy. At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews.
arXiv Detail & Related papers (2024-03-01T10:38:07Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references. We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z)
Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR) Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
Spurious Correlations in Reference-Free Evaluation of Text Generation [35.80256755393739]
We show that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
arXiv Detail & Related papers (2022-04-21T05:32:38Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.