Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
- URL: http://arxiv.org/abs/2508.10192v1
- Date: Wed, 13 Aug 2025 20:55:26 GMT
- Title: Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
- Authors: Igor Halperin,
- Abstract summary: This paper introduces Semantic Divergence Metrics (SDM), a novel framework for detecting Faithfulness Hallucinations.<n>A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.
Related papers
- Feedback Indices to Evaluate LLM Responses to Rebuttals for Multiple Choice Type Questions [0.0]
We present a framework designed to characterize Large Language Model (LLM) responses when challenged with rebuttals during a chat.<n>Our approach employs a fictitious-response rebuttal method that quantifies LLM behavior when presented with multiple-choice questions.<n>The indices are specifically designed to detect and measure what could be characterized as sycophantic behavior.
arXiv Detail & Related papers (2026-01-02T21:16:43Z) - ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models [32.099137908375546]
ClarifyMT-Bench is a benchmark for multi-turn clarification in large language models (LLMs)<n>We construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns.<n>We propose textbfClarifyAgent, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning.
arXiv Detail & Related papers (2025-12-24T11:39:00Z) - Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z) - Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation [25.48843962368704]
We introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere.<n>We estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere.<n>We find that higher semantic isotropy -- that is, greater embedding dispersion -- reliably signals lower factual consistency across samples.
arXiv Detail & Related papers (2025-10-24T03:24:57Z) - Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models [49.435669307386156]
Multi-stage Prompt Refinement (MPR) is a framework designed to systematically improve ill-formed prompts across multiple stages.<n>MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input.<n>Results on hallucination benchmarks show that MPR achieve over an 85% win rate compared to their original forms.
arXiv Detail & Related papers (2025-10-14T00:31:36Z) - Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck [0.0]
We develop a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering.<n>Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound.
arXiv Detail & Related papers (2025-08-26T20:00:51Z) - Semantic Energy: Detecting LLM Hallucination Beyond Entropy [106.92072182161712]
Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations.<n>Uncertainty estimation is a feasible approach to detect such hallucinations.<n>We introduce Semantic Energy, a novel uncertainty estimation framework.
arXiv Detail & Related papers (2025-08-20T07:33:50Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Clarifying Ambiguities: on the Role of Ambiguity Types in Prompting Methods for Clarification Generation [5.259846811078731]
We focus on the concept of ambiguity for clarification, seeking to model and integrate ambiguities in the clarification process.<n>We name this new prompting scheme Ambiguity Type-Chain of Thought (AT-CoT)
arXiv Detail & Related papers (2025-04-16T14:21:02Z) - DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection [45.56930979807214]
Class-agnostic object detection can be a cornerstone or a bottleneck for many downstream vision tasks.<n>We investigate using vision-language models to enhance object detection via a self-supervised prompt learning strategy.<n>We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS.
arXiv Detail & Related papers (2024-06-21T07:33:37Z) - QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering [32.436530949623155]
We propose a unique scaling strategy to capture latent semantic center features of queries.
These features are seamlessly integrated into traditional query and passage embeddings.
Our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers.
arXiv Detail & Related papers (2024-04-30T07:34:42Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Elastic Weight Removal for Faithful and Abstractive Dialogue Generation [61.40951756070646]
A dialogue system should generate responses that are faithful to the knowledge contained in relevant documents.
Many models generate hallucinated responses instead that contradict it or contain unverifiable information.
We show that our method can be extended to simultaneously discourage hallucinations and extractive responses.
arXiv Detail & Related papers (2023-03-30T17:40:30Z) - Pareto Probing: Trading Off Accuracy for Complexity [87.09294772742737]
We argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance.
Our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
arXiv Detail & Related papers (2020-10-05T17:27:31Z) - Generating Dialogue Responses from a Semantic Latent Space [75.18449428414736]
We propose an alternative to the end-to-end classification on vocabulary.
We learn the pair relationship between the prompts and responses as a regression task on a latent space.
Human evaluation showed that learning the task on a continuous space can generate responses that are both relevant and informative.
arXiv Detail & Related papers (2020-10-04T19:06:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.