Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
- URL: http://arxiv.org/abs/2510.27391v1
- Date: Fri, 31 Oct 2025 11:32:15 GMT
- Title: Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
- Authors: Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi,
- Abstract summary: Alignment across Trees is a method that constructs and aligns tree-like hierarchical features for both image and text modalities.<n>We introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers.
- Score: 49.95082206008502
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
Related papers
- Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z) - Wasserstein-Aligned Hyperbolic Multi-View Clustering [58.29261653100388]
This paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering.<n>Our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling.
arXiv Detail & Related papers (2025-12-10T07:56:19Z) - Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z) - Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z) - HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis [21.380034877048644]
We propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies in hyperbolic space.<n>Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space.<n>Our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis.
arXiv Detail & Related papers (2025-06-19T15:30:33Z) - Semantic-Space-Intervened Diffusive Alignment for Visual Classification [11.621655970763467]
Cross-modal alignment is an effective approach to improving visual classification.<n>This paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA.<n> Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods.
arXiv Detail & Related papers (2025-05-09T01:41:23Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - Contrastive Multi-view Hyperbolic Hierarchical Clustering [33.050054725595736]
We propose Contrastive Multi-view Hyperbolic Hierarchical Clustering (CMHHC)
It consists of three components, i.e., multi-view alignment learning, aligned feature similarity learning, and continuous hyperbolic hierarchical clustering.
Experimental results on five real-world datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-05-05T12:56:55Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.