Related papers: Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations

Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations

URL: http://arxiv.org/abs/2509.10963v1
Date: Sat, 13 Sep 2025 19:44:42 GMT
Title: Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations
Authors: Aranyak Acharyya, Carey E. Priebe, Hayden S. Helm,
Abstract summary: Given two input queries, it is natural to ask if their response distributions are the same.<n>A traditional test of equality might indicate that two semantically equivalent queries induce statistically different response distributions.<n>In this paper, we address this misalignment by incorporating into the testing procedure consideration of a collection of semantically similar queries.
Score: 10.216191904121178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given an input query, generative models such as large language models produce a random response drawn from a response distribution. Given two input queries, it is natural to ask if their response distributions are the same. While traditional statistical hypothesis testing is designed to address this question, the response distribution induced by an input query is often sensitive to semantically irrelevant perturbations to the query, so much so that a traditional test of equality might indicate that two semantically equivalent queries induce statistically different response distributions. As a result, the outcome of the statistical test may not align with the user's requirements. In this paper, we address this misalignment by incorporating into the testing procedure consideration of a collection of semantically similar queries. In our setting, the mapping from the collection of user-defined semantically similar queries to the corresponding collection of response distributions is not known a priori and must be estimated, with a fixed budget. Although the problem we address is quite general, we focus our analysis on the setting where the responses are binary, show that the proposed test is asymptotically valid and consistent, and discuss important practical considerations with respect to power and computation.

Related papers

Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis [2.905751301655124]
We develop a principled framework based on a shared responsibility of query specification between user and system.<n>Applying the framework to evaluations for question answering and analysis, we analyze the queries in 15 popular datasets.
arXiv Detail & Related papers (2025-11-06T17:39:18Z)
Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z)
Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses [7.581259361859477]
Uncertainty quantification tools can be used to reject a response when the model is uncertain'<n>We estimate the Probability the model assigns to Adequate Responses (PROBAR)<n>We find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.
arXiv Detail & Related papers (2024-12-20T09:02:26Z)
Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries [85.81295563405433]
We present a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation.<n>We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping benchmark rankings between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks [2.1899189033259305]
The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, and (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
arXiv Detail & Related papers (2024-04-25T18:35:54Z)
Mitigating LLM Hallucinations via Conformal Abstention [70.83870602967625]
We develop a principled procedure for determining when a large language model should abstain from responding in a general domain. We leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate) Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets.
arXiv Detail & Related papers (2024-04-04T11:32:03Z)
Selecting Query-bag as Pseudo Relevance Feedback for Information-seeking Conversations [76.70349332096693]
Information-seeking dialogue systems are widely used in e-commerce systems. We propose a Query-bag based Pseudo Relevance Feedback framework (QB-PRF) It constructs a query-bag with related queries to serve as pseudo signals to guide information-seeking conversations.
arXiv Detail & Related papers (2024-03-22T08:10:32Z)
Searching for Better Database Queries in the Outputs of Semantic Parsers [16.221439565760058]
In this paper, we consider the case when, at the test time, the system has access to an external criterion that evaluates the generated queries. The criterion can vary from checking that a query executes without errors to verifying the query on a set of tests. We apply our approach to the state-of-the-art semantics and report that it allows us to find many queries passing all the tests on different datasets.
arXiv Detail & Related papers (2022-10-13T17:20:45Z)
Knowledge Base Question Answering by Case-based Reasoning over Subgraphs [81.22050011503933]
We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
arXiv Detail & Related papers (2022-02-22T01:34:35Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.