Semantic Answer Similarity for Evaluating Question Answering Models
- URL: http://arxiv.org/abs/2108.06130v1
- Date: Fri, 13 Aug 2021 09:12:27 GMT
- Title: Semantic Answer Similarity for Evaluating Question Answering Models
- Authors: Julian Risch and Timo M\"oller and Julian Gutsch and Malte Pietsch
- Abstract summary: SAS is a cross-encoder-based metric for the estimation of semantic answer similarity.
We show that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics.
- Score: 2.279676596857721
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of question answering models compares ground-truth annotations
with model predictions. However, as of today, this comparison is mostly
lexical-based and therefore misses out on answers that have no lexical overlap
but are still semantically similar, thus treating correct answers as false.
This underestimation of the true performance of models hinders user acceptance
in applications and complicates a fair comparison of different models.
Therefore, there is a need for an evaluation metric that is based on semantics
instead of pure string similarity. In this short paper, we present SAS, a
cross-encoder-based metric for the estimation of semantic answer similarity,
and compare it to seven existing metrics. To this end, we create an English and
a German three-way annotated evaluation dataset containing pairs of answers
along with human judgment of their semantic similarity, which we release along
with an implementation of the SAS metric and the experiments. We find that
semantic similarity metrics based on recent transformer models correlate much
better with human judgment than traditional lexical similarity metrics on our
two newly created datasets and one dataset from related work.
Related papers
- Data Similarity is Not Enough to Explain Language Model Performance [6.364065652816667]
Similarity measures correlate with language model performance.
Similarity metrics are not correlated with accuracy or even each other.
This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
arXiv Detail & Related papers (2023-11-15T14:48:08Z) - Semantic similarity prediction is better than other semantic similarity
measures [5.176134438571082]
We argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task.
Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
arXiv Detail & Related papers (2023-09-22T08:11:01Z) - Counting Like Human: Anthropoid Crowd Counting on Modeling the
Similarity of Objects [92.80955339180119]
mainstream crowd counting methods regress density map and integrate it to obtain counting results.
Inspired by this, we propose a rational and anthropoid crowd counting framework.
arXiv Detail & Related papers (2022-12-02T07:00:53Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Evaluation of Semantic Answer Similarity Metrics [0.0]
We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures.
We provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
arXiv Detail & Related papers (2022-06-25T14:40:36Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - Identifying Ambiguous Similarity Conditions via Semantic Matching [49.06931755266372]
We introduce Weakly Supervised Conditional Similarity Learning (WS-CSL)
WS-CSL learns multiple embeddings to match semantic conditions without explicit condition labels such as "can fly"
We propose the Distance Induced Semantic COndition VERification Network (DiscoverNet), which characterizes the instance-instance and triplets-condition relations in a "decompose-and-fuse" manner.
arXiv Detail & Related papers (2022-04-08T13:15:55Z) - 'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names
for 1M Entities [0.0]
We introduce the Wiki Entity Similarity (WES) dataset, an 11M example, domain targeted, semantic entity similarity dataset that is generated from link texts in Wikipedia.
WES is tailored to QA evaluation: the examples are entities and phrases and grouped into semantic clusters to simulate multiple ground-truth labels.
Human annotators consistently agree with WES labels, and a basic cross encoder metric is better than four classic metrics at predicting human judgments of correctness.
arXiv Detail & Related papers (2022-02-28T07:12:39Z) - A Theory-Driven Self-Labeling Refinement Method for Contrastive
Representation Learning [111.05365744744437]
Unsupervised contrastive learning labels crops of the same image as positives, and other image crops as negatives.
In this work, we first prove that for contrastive learning, inaccurate label assignment heavily impairs its generalization for semantic instance discrimination.
Inspired by this theory, we propose a novel self-labeling refinement approach for contrastive learning.
arXiv Detail & Related papers (2021-06-28T14:24:52Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.