Extending Text Informativeness Measures to Passage Interestingness
Evaluation (Language Model vs. Word Embedding)
- URL: http://arxiv.org/abs/2004.06747v1
- Date: Tue, 14 Apr 2020 18:22:48 GMT
- Title: Extending Text Informativeness Measures to Passage Interestingness
Evaluation (Language Model vs. Word Embedding)
- Authors: Carlos-Emiliano Gonz\'alez-Gallardo, Eric SanJuan, Juan-Manuel
Torres-Moreno
- Abstract summary: This paper defines the concept of Interestingness as a generalization of Informativeness.
We then study the ability of state of the art Informativeness measures to cope with this generalization.
We prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.
- Score: 1.2998637003026272
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Standard informativeness measures used to evaluate Automatic Text
Summarization mostly rely on n-gram overlapping between the automatic summary
and the reference summaries. These measures differ from the metric they use
(cosine, ROUGE, Kullback-Leibler, Logarithm Similarity, etc.) and the bag of
terms they consider (single words, word n-grams, entities, nuggets, etc.).
Recent word embedding approaches offer a continuous alternative to discrete
approaches based on the presence/absence of a text unit. Informativeness
measures have been extended to Focus Information Retrieval evaluation involving
a user's information need represented by short queries. In particular for the
task of CLEF-INEX Tweet Contextualization, tweet contents have been considered
as queries. In this paper we define the concept of Interestingness as a
generalization of Informativeness, whereby the information need is diverse and
formalized as an unknown set of implicit queries. We then study the ability of
state of the art Informativeness measures to cope with this generalization.
Lately we show that with this new framework, standard word embeddings
outperforms discrete measures only on uni-grams, however bi-grams seems to be a
key point of interestingness evaluation. Lastly we prove that the CLEF-INEX
Tweet Contextualization 2012 Logarithm Similarity measure provides best
results.
Related papers
- Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [21.650619533772232]
This work investigates whether and to what degree superficial attributes of summary texts suffice to predict factuality''
We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements.
Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate factuality'' scores by appending innocuous sentences to generated summaries.
arXiv Detail & Related papers (2024-11-25T18:15:15Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Looking at words and points with attention: a benchmark for
text-to-shape coherence [17.340484439401894]
The evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark.
We employ large language models to automatically refine descriptions associated with shapes.
To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones.
The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark.
arXiv Detail & Related papers (2023-09-14T17:59:48Z) - Entity Disambiguation with Entity Definitions [50.01142092276296]
Local models have recently attained astounding performances in Entity Disambiguation (ED)
Previous works limited their studies to using, as the textual representation of each candidate, only its Wikipedia title.
In this paper, we address this limitation and investigate to what extent more expressive textual representations can mitigate it.
We report a new state of the art on 2 out of 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns.
arXiv Detail & Related papers (2022-10-11T17:46:28Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Consistency and Coherence from Points of Contextual Similarity [0.0]
ESTIME measure, recently proposed specifically for factual consistency, achieves high correlations with human expert scores.
This is not a problem for current styles of summarization, but it may become an obstacle for future summarization systems.
arXiv Detail & Related papers (2021-12-22T03:04:20Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.