Evaluation of Semantic Answer Similarity Metrics
- URL: http://arxiv.org/abs/2206.12664v1
- Date: Sat, 25 Jun 2022 14:40:36 GMT
- Title: Evaluation of Semantic Answer Similarity Metrics
- Authors: Farida Mustafazade, Peter Ebbinghaus
- Abstract summary: We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures.
We provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are several issues with the existing general machine translation or
natural language generation evaluation metrics, and question-answering (QA)
systems are indifferent in that context. To build robust QA systems, we need
the ability to have equivalently robust evaluation systems to verify whether
model predictions to questions are similar to ground-truth annotations. The
ability to compare similarity based on semantics as opposed to pure string
overlap is important to compare models fairly and to indicate more realistic
acceptance criteria in real-life applications. We build upon the first to our
knowledge paper that uses transformer-based model metrics to assess semantic
answer similarity and achieve higher correlations to human judgement in the
case of no lexical overlap. We propose cross-encoder augmented bi-encoder and
BERTScore models for semantic answer similarity, trained on a new dataset
consisting of name pairs of US-American public figures. As far as we are
concerned, we provide the first dataset of co-referent name string pairs along
with their similarities, which can be used for training.
Machine Learning & Applications 4th International Conference on Machine
Learning & Applications (CMLA 2022) June 25~26, 2022, Copenhagen, Denmark
Volume Editors : David C. Wyld, Dhinaharan Nagamalai (Eds) ISBN :
978-1-925953-69-5
Related papers
- SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - 'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names
for 1M Entities [0.0]
We introduce the Wiki Entity Similarity (WES) dataset, an 11M example, domain targeted, semantic entity similarity dataset that is generated from link texts in Wikipedia.
WES is tailored to QA evaluation: the examples are entities and phrases and grouped into semantic clusters to simulate multiple ground-truth labels.
Human annotators consistently agree with WES labels, and a basic cross encoder metric is better than four classic metrics at predicting human judgments of correctness.
arXiv Detail & Related papers (2022-02-28T07:12:39Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Semantic Answer Similarity for Evaluating Question Answering Models [2.279676596857721]
SAS is a cross-encoder-based metric for the estimation of semantic answer similarity.
We show that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics.
arXiv Detail & Related papers (2021-08-13T09:12:27Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
via Question Generation and Question Answering [38.951535576102906]
We propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models.
Our metric makes use of co-reference resolution and natural language inference capabilities.
We curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset.
arXiv Detail & Related papers (2021-04-16T16:21:16Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.