From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
- URL: http://arxiv.org/abs/2511.22150v2
- Date: Mon, 01 Dec 2025 06:39:02 GMT
- Title: From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
- Authors: Florian Rottach, William Rudman, Bastian Rieck, Harrisen Scells, Carsten Eickhoff,
- Abstract summary: We present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets.<n>We introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces.
- Score: 38.75080027435365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.
Related papers
- Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation [8.584363058858935]
Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities.<n>We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures.<n>We propose textbfGASeg, a novel framework that bridges appearance and geometry by leveraging stable topological information.
arXiv Detail & Related papers (2025-12-30T05:34:28Z) - GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs [59.61242815508687]
Graph neural networks (GNNs) on text--attributed graphs (TAGs) encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation.<n>This work introduces a local PCA-based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure.
arXiv Detail & Related papers (2025-11-12T06:48:43Z) - Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents [11.168089496463125]
Large language models (LLMs) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts.<n>We introduce a framework for semi-automatic annotation of these embedding properties.
arXiv Detail & Related papers (2025-07-24T17:43:40Z) - Analytical Discovery of Manifold with Machine Learning [2.6585498155499643]
We introduce a novel framework, GAMLA (Global Analytical Manifold Learning using Auto-encoding)<n>GAMLA employs a two-round training process within an auto-encoding framework to derive both character and complementary representations for the underlying manifold.<n>We find the two representations together decompose the whole latent space and can thus characterize the local spatial structure surrounding the manifold.
arXiv Detail & Related papers (2025-04-03T11:53:00Z) - Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts [3.9426000822656224]
We conjecture that in large language models, the embeddings live in a local manifold structure with different dimensions depending on the perplexities and domains of the input data.<n>By incorporating an attention-based soft-gating network, we verify that our model learns specialized sub-manifolds for an ensemble of input data sources.
arXiv Detail & Related papers (2025-02-19T09:33:16Z) - Persistent Topological Features in Large Language Models [0.6597195879147556]
We introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers.<n>This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space.<n>As a showcase application, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-14T19:46:23Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates.
Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z) - Structure-Aware Feature Generation for Zero-Shot Learning [108.76968151682621]
We introduce a novel structure-aware feature generation scheme, termed as SA-GAN, to account for the topological structure in learning both the latent space and the generative networks.
Our method significantly enhances the generalization capability on unseen-classes and consequently improve the classification performance.
arXiv Detail & Related papers (2021-08-16T11:52:08Z) - HUMAP: Hierarchical Uniform Manifold Approximation and Projection [40.77787659104315]
This work presents HUMAP, a novel hierarchical dimensionality reduction technique designed to be flexible on preserving local and global structures.<n>We provide empirical evidence of our technique's superiority compared with current hierarchical approaches and show a case study applying HUMAP for dataset labelling.
arXiv Detail & Related papers (2021-06-14T19:27:54Z) - Transforming Feature Space to Interpret Machine Learning Models [91.62936410696409]
This contribution proposes a novel approach that interprets machine-learning models through the lens of feature space transformations.
It can be used to enhance unconditional as well as conditional post-hoc diagnostic tools.
A case study on remote-sensing landcover classification with 46 features is used to demonstrate the potential of the proposed approach.
arXiv Detail & Related papers (2021-04-09T10:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.