DISTO: Evaluating Textual Distractors for Multi-Choice Questions using
Negative Sampling based Approach
- URL: http://arxiv.org/abs/2304.04881v1
- Date: Mon, 10 Apr 2023 22:03:00 GMT
- Title: DISTO: Evaluating Textual Distractors for Multi-Choice Questions using
Negative Sampling based Approach
- Authors: Bilal Ghanem and Alona Fyshe
- Abstract summary: Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC)
Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics.
We propose DISTO: the first learned evaluation metric for generated distractors.
- Score: 5.033269502052902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple choice questions (MCQs) are an efficient and common way to assess
reading comprehension (RC). Every MCQ needs a set of distractor answers that
are incorrect, but plausible enough to test student knowledge. Distractor
generation (DG) models have been proposed, and their performance is typically
evaluated using machine translation (MT) metrics. However, MT metrics often
misjudge the suitability of generated distractors. We propose DISTO: the first
learned evaluation metric for generated distractors. We validate DISTO by
showing its scores correlate highly with human ratings of distractor quality.
At the same time, DISTO ranks the performance of state-of-the-art DG models
very differently from MT-based metrics, showing that MT metrics should not be
used for distractor evaluation.
Related papers
- Metric assessment protocol in the context of answer fluctuation on MCQ tasks [4.453107218424601]
Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently.<n>Previous research has not conducted a thorough assessment of them.<n>We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates.
arXiv Detail & Related papers (2025-07-21T13:01:46Z) - AskQE: Question Answering as Automatic Evaluation for Machine Translation [24.088731832956373]
We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback.
We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall's Tau correlation and decision accuracy with human ratings compared to other QE metrics.
arXiv Detail & Related papers (2025-04-15T19:57:42Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA)
In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation [24.481028155002523]
metric interference (MINT) causes model tuning and evaluation problems.<n>MINT can misguide practitioners into being overoptimistic about the performance of their systems.<n>We propose MINTADJUST, a method for more reliable evaluation under MINT.
arXiv Detail & Related papers (2025-03-11T11:40:10Z) - Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types.
We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.