Related papers: Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

URL: http://arxiv.org/abs/2601.09731v1
Date: Mon, 29 Dec 2025 14:00:12 GMT
Title: Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings
Authors: Wen G Gong,
Abstract summary: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope.<n>Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models.<n>These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.

Related papers

From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures [38.75080027435365]
We present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets.<n>We introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces.
arXiv Detail & Related papers (2025-11-27T06:37:45Z)
GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs [59.61242815508687]
Graph neural networks (GNNs) on text--attributed graphs (TAGs) encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation.<n>This work introduces a local PCA-based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure.
arXiv Detail & Related papers (2025-11-12T06:48:43Z)
Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models [2.3204178451683264]
We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic transformations as consistent rotational operations in embedding space.<n>RISE operations have the ability to operate across both languages and models with high transfer of performance.<n>This work provides the first systematic demonstration that discourse-level semantic transformations correspond to consistent geometric operations in multilingual embedding spaces.
arXiv Detail & Related papers (2025-10-10T18:51:32Z)
Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings [0.0]
We investigate geometric patterns in Chinese character embeddings using PHATE manifold analysis.<n>We observe clustering patterns for content words and branching patterns for function words.
arXiv Detail & Related papers (2025-09-23T14:28:34Z)
Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations [34.88156871518115]
Next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text.<n>We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy.<n>This insight motivates orthant-based clustering, a method that combines concept signs to identify interpretable semantic categories.
arXiv Detail & Related papers (2025-05-13T08:46:04Z)
MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams [65.02628814094639]
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.<n>Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether Multimodal Large Language Models genuinely understand mathematical diagrams beyond superficial pattern recognition.<n>We introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.<n>We construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text annotated with geometric primitives and precise spatial relationships.
arXiv Detail & Related papers (2025-03-26T17:30:41Z)
Geometric Signatures of Compositionality Across a Language Model's Lifetime [47.25475802128033]
We study whether contemporary language models reflect intrinsic simplicity of language enabled by compositionality.<n>We find that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training.<n>Our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.
arXiv Detail & Related papers (2024-10-02T11:54:06Z)
A Joint Matrix Factorization Analysis of Multilingual Representations [28.751144371901958]
We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. We study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models.
arXiv Detail & Related papers (2023-10-24T04:43:45Z)
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. We introduce a new Compositional Temporal Grounding task and construct two new dataset splits. We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z)
The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning [62.601681746034956]
Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each.
arXiv Detail & Related papers (2022-09-18T18:15:38Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures. We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.