StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
- URL: http://arxiv.org/abs/2602.20089v3
- Date: Mon, 02 Mar 2026 08:46:07 GMT
- Title: StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
- Authors: Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, Marco Cristani,
- Abstract summary: We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps as proxies for the visual structure of an image.<n>Fine-tuning augments the standard alignment loss with three structure-centric losses.<n>Our method serves as a general boosting recipe that can be integrated into future approaches.
- Score: 12.94672471629668
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.
Related papers
- StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval [75.28673512571449]
A key challenge in Continual Text-to-Video Retrieval is feature drift.<n>We propose StructAlign, a structured cross-modal alignment method for CTVR.<n>Our method consistently outperforms state-of-the-art continual retrieval approaches.
arXiv Detail & Related papers (2026-01-28T13:34:44Z) - MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP [4.6096940605642915]
MulCLIP is an end-to-end framework that bridges natural long-text structures with image components.<n>It preserves global contrastive alignment between images and both summary and long captions.<n>It extends positional embeddings for longer text sequences.
arXiv Detail & Related papers (2025-12-08T03:23:41Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z) - ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Structured Multi-modal Feature Embedding and Alignment for
Image-Sentence Retrieval [12.050958976545914]
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments.
We propose a novel Structured Multi-modal Feature Embedding and Alignment model for image-sentence retrieval.
In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels.
arXiv Detail & Related papers (2021-08-05T07:24:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.