Related papers: Testing Question Answering Software with Context-Driven Question Generation

Testing Question Answering Software with Context-Driven Question Generation

URL: http://arxiv.org/abs/2511.07924v1
Date: Wed, 12 Nov 2025 01:28:50 GMT
Title: Testing Question Answering Software with Context-Driven Question Generation
Authors: Shuang Liu, Zhirun Zhang, Jinhao Dong, Zan Wang, Qingchao Shen, Junjie Chen, Wei Lu, Xiaoyong Du,
Abstract summary: We introduce CQ2A, a context-driven question generation approach for testing question-answering systems.<n>CQ2A extracts entities and relationships from the context to form ground truth answers.<n>CQ2A outperforms state-of-the-art approaches on the bug detection capability, the naturalness of the generated questions as well as the coverage of the context.
Score: 19.83376005515088
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Question-answering software is becoming increasingly integrated into our daily lives, with prominent examples including Apple Siri and Amazon Alexa. Ensuring the quality of such systems is critical, as incorrect answers could lead to significant harm. Current state-of-the-art testing approaches apply metamorphic relations to existing test datasets, generating test questions based on these relations. However, these methods have two key limitations. First, they often produce unnatural questions that humans are unlikely to ask, reducing the effectiveness of the generated questions in identifying bugs that might occur in real-world scenarios. Second, these questions are generated from pre-existing test datasets, ignoring the broader context and thus limiting the diversity and relevance of the generated questions. In this work, we introduce CQ^2A, a context-driven question generation approach for testing question-answering systems. Specifically, CQ^2A extracts entities and relationships from the context to form ground truth answers, and utilizes large language models to generate questions based on these ground truth answers and the surrounding context. We also propose the consistency verification and constraint checking to increase the reliability of LLM's outputs. Experiments conducted on three datasets demonstrate that CQ^2A outperforms state-of-the-art approaches on the bug detection capability, the naturalness of the generated questions as well as the coverage of the context. Moreover, the test cases generated by CQ^2A reduce error rate when utilized for fine-tuning the QA software under test

Related papers

Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
UQ: Assessing Language Models on Unsolved Questions [149.46593270027697]
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange.<n>UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers.<n>The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers.
arXiv Detail & Related papers (2025-08-25T01:07:59Z)
Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets [7.52684798377727]
We introduce Syn-(QA)$2$, a set of two synthetically generated question-answering (QA) datasets. We find that false assumptions in QA are challenging, echoing the findings of prior work. The detection task is more challenging with long-tail questions compared to naturally occurring questions.
arXiv Detail & Related papers (2024-03-18T18:01:26Z)
Alexpaca: Learning Factual Clarification Question Generation Without Examples [19.663171923249283]
We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. Humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics.
arXiv Detail & Related papers (2023-10-17T20:40:59Z)
QASnowball: An Iterative Bootstrapping Framework for High-Quality Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball) QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z)
AGent: A Novel Pipeline for Automatically Creating Unanswerable Questions [10.272000561545331]
We propose AGent, a novel pipeline that creates new unanswerable questions by re-matching a question with a context that lacks the necessary information for a correct answer. In this paper, we demonstrate the usefulness of this AGent pipeline by creating two sets of unanswerable questions from answerable questions in SQuAD and HotpotQA.
arXiv Detail & Related papers (2023-09-10T18:13:11Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)
How to Build Robust FAQ Chatbot with Controllable Question Generator? [5.680871239968297]
We propose a high-quality, diverse, controllable method to generate adversarial samples with a semantic graph. The fluent and semantically generated QA pairs fool our passage retrieval model successfully. We find that the generated data set improves the generalizability of the QA model to the new target domain.
arXiv Detail & Related papers (2021-11-18T12:54:07Z)
Improving Unsupervised Question Answering via Summarization-Informed Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair. We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z)
Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space [94.8320535537798]
Controllable Rewriting based Question Data Augmentation (CRQDA) for machine reading comprehension (MRC), question generation, and question-answering natural language inference tasks. We treat the question data augmentation task as a constrained question rewriting problem to generate context-relevant, high-quality, and diverse question data samples.
arXiv Detail & Related papers (2020-10-04T03:13:46Z)
Do not let the history haunt you -- Mitigating Compounding Errors in Conversational Question Answering [17.36904526340775]
We find that compounding errors occur when using previously predicted answers at test time. We propose a sampling strategy that dynamically selects between target answers and model predictions during training.
arXiv Detail & Related papers (2020-05-12T13:29:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.