SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- URL: http://arxiv.org/abs/2404.01156v1
- Date: Mon, 1 Apr 2024 15:01:38 GMT
- Title: SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- Authors: Chull Hwan Song, Taebaek Hwang, Jooyoung Yoon, Shunghyun Choi, Yeong Hyeon Gu,
- Abstract summary: Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
- Score: 2.9010546489056415
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However, in fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text, leading to cases where some textual details are not visible in individual images. This mismatch, particularly when non-co-occurring elements are masked, undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem, we propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model, ensuring a precise alignment between the two modalities. Additionally, we enhance grouped batch sampling with semi-hard negatives, effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach, outperforming existing methods in three downstream tasks.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.