Just Rank: Rethinking Evaluation with Word and Sentence Similarities
- URL: http://arxiv.org/abs/2203.02679v1
- Date: Sat, 5 Mar 2022 08:40:05 GMT
- Title: Just Rank: Rethinking Evaluation with Word and Sentence Similarities
- Authors: Bin Wang, C.-C. Jay Kuo, Haizhou Li
- Abstract summary: intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
- Score: 105.5541653811528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word and sentence embeddings are useful feature representations in natural
language processing. However, intrinsic evaluation for embeddings lags far
behind, and there has been no significant update since the past decade. Word
and sentence similarity tasks have become the de facto evaluation method. It
leads models to overfit to such evaluations, negatively impacting embedding
models' development. This paper first points out the problems using semantic
similarity as the gold standard for word and sentence embedding evaluations.
Further, we propose a new intrinsic evaluation method called EvalRank, which
shows a much stronger correlation with downstream tasks. Extensive experiments
are conducted based on 60+ models and popular datasets to certify our
judgments. Finally, the practical evaluation toolkit is released for future
benchmarking purposes.
Related papers
- Holistic Evaluation for Interleaved Text-and-Image Generation [19.041251355695973]
We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation.
In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
arXiv Detail & Related papers (2024-06-20T18:07:19Z) - Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [55.66090768926881]
We study the correspondence between decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible Effects.
We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation.
We found no correspondence between trick tests and RUTEd evaluations.
arXiv Detail & Related papers (2024-02-20T01:49:15Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Do Smaller Language Models Answer Contextualised Questions Through
Memorisation Or Generalisation? [8.51696622847778]
A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples.
We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers.
arXiv Detail & Related papers (2023-11-21T04:06:08Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - TweetEval: Unified Benchmark and Comparative Evaluation for Tweet
Classification [22.265865542786084]
We propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks.
Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models.
arXiv Detail & Related papers (2020-10-23T14:11:04Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.