Layer or Representation Space: What makes BERT-based Evaluation Metrics
Robust?
- URL: http://arxiv.org/abs/2209.02317v2
- Date: Wed, 7 Sep 2022 08:08:28 GMT
- Title: Layer or Representation Space: What makes BERT-based Evaluation Metrics
Robust?
- Authors: Doan Nam Long Vu, Nafise Sadat Moosavi, Steffen Eger
- Abstract summary: In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation.
We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases.
- Score: 29.859455320349866
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The evaluation of recent embedding-based evaluation metrics for text
generation is primarily based on measuring their correlation with human
evaluations on standard benchmarks. However, these benchmarks are mostly from
similar domains to those used for pretraining word embeddings. This raises
concerns about the (lack of) generalization of embedding-based metrics to new
and noisy domains that contain a different vocabulary than the pretraining
data. In this paper, we examine the robustness of BERTScore, one of the most
popular embedding-based metrics for text generation. We show that (a) an
embedding-based metric that has the highest correlation with human evaluations
on a standard benchmark can have the lowest correlation if the amount of input
noise or unknown tokens increases, (b) taking embeddings from the first layer
of pretrained models improves the robustness of all metrics, and (c) the
highest robustness is achieved when using character-level embeddings, instead
of token-based embeddings, from the first layer of the pretrained model.
Related papers
- Improving LLM Leaderboards with Psychometrical Methodology [0.0]
The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance.
These benchmarks resemble human tests and surveys, as they consist of questions designed to measure emergent properties in the cognitive behavior of these systems.
However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined.
arXiv Detail & Related papers (2025-01-27T21:21:46Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Exploring Category Structure with Contextual Language Models and Lexical
Semantic Networks [0.0]
We test a wider array of methods for probing CLMs for predicting typicality scores.
Our experiments, using BERT, show the importance of using the right type of CLM probes.
Results highlight the importance of polysemy in this task.
arXiv Detail & Related papers (2023-02-14T09:57:23Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.