Related papers: Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

URL: http://arxiv.org/abs/2508.00332v1
Date: Fri, 01 Aug 2025 05:42:28 GMT
Title: Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment
Authors: Kaiyan Zhao, Zhongtao Miao, Yoshimasa Tsuruoka,
Abstract summary: Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training.<n>Such pairs often contain noise, including redundant or irrelevant information on either the image or caption side.<n>We propose an approach that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment.
Score: 14.938401898546553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

Related papers

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets [26.167194142428475]
Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data.<n>We propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation.<n>Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.
arXiv Detail & Related papers (2025-07-07T06:47:10Z)
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets. In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z)
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix. CMC transforms natural sentences from the textual view into a multi-modal view. By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner. We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z)
Linguistic Structure Guided Context Modeling for Referring Image Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction. Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.