Related papers: SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

URL: http://arxiv.org/abs/2511.03019v1
Date: Tue, 04 Nov 2025 21:33:57 GMT
Title: SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
Authors: Wenbo Lu,
Abstract summary: We introduce Structure-aware Language-Image Pretraining (SLIP)<n>SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph.<n>Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks.
Score: 1.0914300987810126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

Related papers

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion [11.686307370683922]
Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities.<n>We propose SLiNT, a modular framework that injects knowledge-graph-derived structural context into a frozen backbone with lightweight LoRA-based adaptation for robust link prediction.<n>Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines.
arXiv Detail & Related papers (2025-09-08T10:36:49Z)
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models [0.609170287691728]
We introduce a novel training paradigm to enhance the comprehension of diagrammatic images within vision-language models.<n>Our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content.
arXiv Detail & Related papers (2025-09-02T05:02:23Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks. GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations. We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA. It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.