Multi-Modal Representation Learning with Text-Driven Soft Masks
- URL: http://arxiv.org/abs/2304.00719v1
- Date: Mon, 3 Apr 2023 05:07:49 GMT
- Title: Multi-Modal Representation Learning with Text-Driven Soft Masks
- Authors: Jaeyoo Park, Bohyung Han
- Abstract summary: We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
- Score: 48.19806080407593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a visual-linguistic representation learning approach within a
self-supervised learning framework by introducing a new operation, loss, and
data augmentation strategy. First, we generate diverse features for the
image-text matching (ITM) task via soft-masking the regions in an image, which
are most relevant to a certain word in the corresponding caption, instead of
completely removing them. Since our framework relies only on image-caption
pairs with no fine-grained annotations, we identify the relevant regions to
each word by computing the word-conditional visual attention using multi-modal
encoder. Second, we encourage the model to focus more on hard but diverse
examples by proposing a focal loss for the image-text contrastive learning
(ITC) objective, which alleviates the inherent limitations of overfitting and
bias issues. Last, we perform multi-modal data augmentations for
self-supervised learning via mining various examples by masking texts and
rendering distortions on images. We show that the combination of these three
innovations is effective for learning a pretrained model, leading to
outstanding performance on multiple vision-language downstream tasks.
Related papers
- Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining [25.11384964373604]
We propose two pretraining approaches to contextualise visual entities in a multimodal setup.
With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions.
With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts.
arXiv Detail & Related papers (2023-05-23T17:27:12Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.