Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck
- URL: http://arxiv.org/abs/2509.03533v1
- Date: Tue, 26 Aug 2025 20:00:51 GMT
- Title: Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck
- Authors: Igor Halperin,
- Abstract summary: We develop a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering.<n>Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.
Related papers
- Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques [0.0]
This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings.<n>By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden.
arXiv Detail & Related papers (2025-12-31T13:43:57Z) - Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features [0.0]
"context steering" is a methodology that actively guides the feature-shaping process.<n>We validate the capabilities of this strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC)<n> Experimental results across heterogeneous datasets-from text to real-world audio-validate the robustness and generality of context steering.
arXiv Detail & Related papers (2025-08-20T15:26:52Z) - Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering [49.43814054718318]
Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer.<n>Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity.<n>We propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs.
arXiv Detail & Related papers (2025-08-15T06:36:13Z) - Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models [0.0]
This paper introduces Semantic Divergence Metrics (SDM), a novel framework for detecting Faithfulness Hallucinations.<n>A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue.
arXiv Detail & Related papers (2025-08-13T20:55:26Z) - An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z) - Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation [39.7657197805346]
Point cloud few-shot semantic segmentation (PC-FSS) aims to segment targets of novel categories in a given query point cloud with only a few annotated support samples.
We propose a simple yet effective framework in the spirit of Decoupled Localization and Expansion (DLE)
DLE, including a structural localization module (SLM) and a self-expansion module (SEM), enjoys several merits.
arXiv Detail & Related papers (2024-08-25T07:34:32Z) - Unfolding ADMM for Enhanced Subspace Clustering of Hyperspectral Images [43.152314090830174]
We introduce an innovative clustering architecture for hyperspectral images (HSI) by unfolding an iterative solver based on the Alternating Direction Method of Multipliers (ADMM) for sparse subspace clustering.
Our approach captures well the structural characteristics of HSI data by employing the K nearest neighbors algorithm as part of a structure preservation module.
arXiv Detail & Related papers (2024-04-10T15:51:46Z) - Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent
Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies.
A neural network is employed to act as structure prior and reveal the underlying signal interdependencies.
Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z) - Tight integration of neural- and clustering-based diarization through
deep unfolding of infinite Gaussian mixture model [84.57667267657382]
This paper introduces a it trainable clustering algorithm into the integration framework.
Speaker embeddings are optimized during training such that it better fits iGMM clustering.
Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate.
arXiv Detail & Related papers (2022-02-14T07:45:21Z) - A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment.
Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions.
Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z) - Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain
Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain.
We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one.
Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.