Improving Text Generation Evaluation with Batch Centering and Tempered
Word Mover Distance
- URL: http://arxiv.org/abs/2010.06150v1
- Date: Tue, 13 Oct 2020 03:46:25 GMT
- Title: Improving Text Generation Evaluation with Batch Centering and Tempered
Word Mover Distance
- Authors: Xi Chen, Nan Ding, Tomer Levinboim, Radu Soricut
- Abstract summary: We present two techniques for improving encoding representations for similarity metrics.
We show results over various BERT-backbone learned metrics and achieving state of the art correlation with human ratings on several benchmarks.
- Score: 24.49032191669509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in automatic evaluation metrics for text have shown that deep
contextualized word representations, such as those generated by BERT encoders,
are helpful for designing metrics that correlate well with human judgements. At
the same time, it has been argued that contextualized word representations
exhibit sub-optimal statistical properties for encoding the true similarity
between words or sentences. In this paper, we present two techniques for
improving encoding representations for similarity metrics: a batch-mean
centering strategy that improves statistical properties; and a computationally
efficient tempered Word Mover Distance, for better fusion of the information in
the contextualized word representations. We conduct numerical experiments that
demonstrate the robustness of our techniques, reporting results over various
BERT-backbone learned metrics and achieving state of the art correlation with
human ratings on several benchmarks.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - Constructing Vec-tionaries to Extract Message Features from Texts: A
Case Study of Moral Appeals [5.336592570916432]
We present an approach to construct vec-tionary measurement tools that boost validated dictionaries with word embeddings.
A vec-tionary can produce additional metrics to capture the ambivalence of a message feature beyond its strength in texts.
arXiv Detail & Related papers (2023-12-10T20:37:29Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Quantitative Discourse Cohesion Analysis of Scientific Scholarly Texts
using Multilayer Networks [10.556468838821338]
We aim to computationally analyze the discourse cohesion in scientific scholarly texts using multilayer network representation.
We design section-level and document-level metrics to assess the extent of lexical cohesion in text.
We present an analytical framework, CHIAA (CHeck It Again, Author), to provide pointers to the author for potential improvements in the manuscript.
arXiv Detail & Related papers (2022-05-16T09:10:41Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - COSMic: A Coherence-Aware Generation Metric for Image Descriptions [27.41088864449921]
Image metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of text evaluation models.
We present the first learned generation metric for evaluating output captions.
We demonstrate a higher out-efficient for our proposed metric the human judgments for the results of a number of state-of-the-art caption models when compared to several other metrics such as BLEURT and BERT.
arXiv Detail & Related papers (2021-09-11T13:43:36Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.