Topological Alignment of Shared Vision-Language Embedding Space
- URL: http://arxiv.org/abs/2510.10889v1
- Date: Mon, 13 Oct 2025 01:36:38 GMT
- Title: Topological Alignment of Shared Vision-Language Embedding Space
- Authors: Junwon You, Dasol Kang, Jae-Hun Jung,
- Abstract summary: ToMCLIP is a topology-aware framework aligning embedding spaces with topology-preserving constraints.<n>We show enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO.
- Score: 5.5522557994489246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z) - Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z) - Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models [41.79238283279954]
HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level.<n>For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently.<n>For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures.
arXiv Detail & Related papers (2026-01-15T11:45:56Z) - Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion [13.39263294343983]
This paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC)<n>HSACC achieves robust cross-view fusion through a dual-level semantic space design.<n> Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets.
arXiv Detail & Related papers (2025-10-14T02:58:10Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework [4.935224714809964]
We present Multi-Scale Manifold Alignment (MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifold.<n>We observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators.<n> Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence.
arXiv Detail & Related papers (2025-05-24T10:25:58Z) - LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization [4.274520108617021]
LAGO is a novel approach for few-shot cross-lingual embedding inversion attacks.<n>It explicitly models linguistic relationships through a graph-based constrained distributed optimization framework.<n>Experiments show it substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines.
arXiv Detail & Related papers (2025-05-21T20:48:24Z) - Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension [31.952192907460713]
Relation-R1 is the textitfirst unified relation comprehension framework.<n>It integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization ( GRPO)<n>Experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and textitN-ary relation understanding.
arXiv Detail & Related papers (2025-04-20T14:50:49Z) - Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z) - SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection [4.930667479611019]
This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection.<n>It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space.<n>We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
arXiv Detail & Related papers (2024-12-03T16:53:58Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - GL-CLeF: A Global-Local Contrastive Learning Framework for Cross-lingual
Spoken Language Understanding [74.39024160277809]
We present Global--Local Contrastive Learning Framework (GL-CLeF) to address this shortcoming.
Specifically, we employ contrastive learning, leveraging bilingual dictionaries to construct multilingual views of the same utterance.
GL-CLeF achieves the best performance and successfully pulls representations of similar sentences across languages closer.
arXiv Detail & Related papers (2022-04-18T13:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.