Related papers: DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach

DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach

URL: http://arxiv.org/abs/2304.04881v1
Date: Mon, 10 Apr 2023 22:03:00 GMT
Title: DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach
Authors: Bilal Ghanem and Alona Fyshe
Abstract summary: Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC) Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics. We propose DISTO: the first learned evaluation metric for generated distractors.
Score: 5.033269502052902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC). Every MCQ needs a set of distractor answers that are incorrect, but plausible enough to test student knowledge. Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics. However, MT metrics often misjudge the suitability of generated distractors. We propose DISTO: the first learned evaluation metric for generated distractors. We validate DISTO by showing its scores correlate highly with human ratings of distractor quality. At the same time, DISTO ranks the performance of state-of-the-art DG models very differently from MT-based metrics, showing that MT metrics should not be used for distractor evaluation.

Related papers

Metric assessment protocol in the context of answer fluctuation on MCQ tasks [4.453107218424601]
Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently.<n>Previous research has not conducted a thorough assessment of them.<n>We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates.
arXiv Detail & Related papers (2025-07-21T13:01:46Z)
AskQE: Question Answering as Automatic Evaluation for Machine Translation [24.088731832956373]
We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback. We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall's Tau correlation and decision accuracy with human ratings compared to other QE metrics.
arXiv Detail & Related papers (2025-04-15T19:57:42Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation [24.481028155002523]
metric interference (MINT) causes model tuning and evaluation problems.<n>MINT can misguide practitioners into being overoptimistic about the performance of their systems.<n>We propose MINTADJUST, a method for more reliable evaluation under MINT.
arXiv Detail & Related papers (2025-03-11T11:40:10Z)
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function. Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z)
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics. We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.