Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering
- URL: http://arxiv.org/abs/2508.12355v1
- Date: Sun, 17 Aug 2025 12:58:48 GMT
- Title: Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering
- Authors: Eviatar Nachshoni, Arie Cattan, Shmuel Amar, Ori Shapira, Ido Dagan,
- Abstract summary: Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging.<n>We introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA.<n>We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts.
- Score: 22.447638522275092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA can involve conflicting answers. Constructing datasets that reflect such conflicts is costly and labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to yes/no questions, or apply unverified automated annotation. To advance research in this area, we extend the conflict-aware MAQA setting to require models not only to identify all valid answers, but also to detect specific conflicting answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA, a new benchmark for realistic, conflict-aware MAQA, enriched with detailed conflict labels, for all answer pairs. We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts and the flawed strategies they employ to resolve them.
Related papers
- Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Who's Who: Large Language Models Meet Knowledge Conflicts in Practice [28.48156432356721]
We introduce WhoQA, a benchmark dataset to examine model's behavior in knowledge conflict situations.
We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers.
Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
arXiv Detail & Related papers (2024-10-21T07:56:45Z) - PokeMQA: Programmable knowledge editing for Multi-hop Question Answering [46.80110170981976]
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine's comprehension and reasoning abilities.
We propose a framework, Programmable knowledge editing for Multi-hop Question Answering (MQA)
Specifically, we prompt LLMs to decompose knowledge-augmented multi-hop question, while interacting with a detached trainable scope detector to modulate LLMs behavior depending on external conflict signal.
arXiv Detail & Related papers (2023-12-23T08:32:13Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - How to Build Robust FAQ Chatbot with Controllable Question Generator? [5.680871239968297]
We propose a high-quality, diverse, controllable method to generate adversarial samples with a semantic graph.
The fluent and semantically generated QA pairs fool our passage retrieval model successfully.
We find that the generated data set improves the generalizability of the QA model to the new target domain.
arXiv Detail & Related papers (2021-11-18T12:54:07Z) - Logically Consistent Loss for Visual Question Answering [66.83963844316561]
The current advancement in neural-network based Visual Question Answering (VQA) cannot ensure such consistency due to identically distribution (i.i.d.) assumption.
We propose a new model-agnostic logic constraint to tackle this issue by formulating a logically consistent loss in the multi-task learning framework.
Experiments confirm that the proposed loss formulae and introduction of hybrid-batch leads to more consistency as well as better performance.
arXiv Detail & Related papers (2020-11-19T20:31:05Z) - Do not let the history haunt you -- Mitigating Compounding Errors in
Conversational Question Answering [17.36904526340775]
We find that compounding errors occur when using previously predicted answers at test time.
We propose a sampling strategy that dynamically selects between target answers and model predictions during training.
arXiv Detail & Related papers (2020-05-12T13:29:38Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.