Related papers: Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

URL: http://arxiv.org/abs/2601.08849v1
Date: Mon, 22 Dec 2025 17:39:13 GMT
Title: Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment
Authors: Manas Khatore, Sumana Sridharan, Kevork Sulahian, Benjamin J. Smith, Shi Feng,
Abstract summary: Automated answer matching shows substantial promise as a scalable and aligned alternative to human evaluation.<n>We investigate whether such tactics deceive answer matching models by prompting examinee models to generate verbose responses.<n>Our results show that these manipulations do not increase scores and often reduce them.
Score: 6.104512852467398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive "correct" or "incorrect") is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.

Related papers

Reasoning About Intent for Ambiguous Requests [47.979705857002415]
We propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests.<n>Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision.
arXiv Detail & Related papers (2025-11-13T16:18:45Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Answer Matching Outperforms Multiple Choice for Language Model Evaluation [35.90520208701438]
We show multiple choice questions from popular benchmarks can often be answered without even seeing the question.<n>We consider generative evaluation via what we call answer matching.
arXiv Detail & Related papers (2025-07-03T17:59:02Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems [71.33737787564966]
End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called 'likelihood trap' We propose a reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system. Our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2 ROUGE, and 2.8 METEOR scores, achieving new peak results.
arXiv Detail & Related papers (2022-11-07T15:59:49Z)
TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack [93.50174324435321]
We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models. TASA produces fluent and grammatical adversarial contexts while maintaining gold answers.
arXiv Detail & Related papers (2022-10-27T07:16:30Z)
A Systematic Evaluation of Response Selection for Open Domain Dialogue [36.88551817451512]
We curated a dataset where responses from multiple response generators produced for the same dialog context are manually annotated as appropriate (positive) and inappropriate (negative) We conduct a systematic evaluation of state-of-the-art methods for response selection, and demonstrate that both strategies of using multiple positive candidates and using manually verified hard negative candidates can bring in significant performance improvement in comparison to using the adversarial training data, e.g., increase of 3% and 13% in Recall@1 score, respectively.
arXiv Detail & Related papers (2022-08-08T19:33:30Z)
A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.