MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding
- URL: http://arxiv.org/abs/2010.05379v1
- Date: Mon, 12 Oct 2020 00:43:52 GMT
- Title: MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding
- Authors: Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao
- Abstract summary: We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
- Score: 74.33171794972688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phrase localization is a task that studies the mapping from textual phrases
to regions of an image. Given difficulties in annotating phrase-to-object
datasets at scale, we develop a Multimodal Alignment Framework (MAF) to
leverage more widely-available caption-image datasets, which can then be used
as a form of weak supervision. We first present algorithms to model
phrase-object relevance by leveraging fine-grained visual representations and
visually-aware language representations. By adopting a contrastive objective,
our method uses information in caption-image pairs to boost the performance in
weakly-supervised scenarios. Experiments conducted on the widely-adopted
Flickr30k dataset show a significant improvement over existing
weakly-supervised methods. With the help of the visually-aware language
representations, we can also improve the previous best unsupervised result by
5.56%. We conduct ablation studies to show that both our novel model and our
weakly-supervised strategies significantly contribute to our strong results.
Related papers
- Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining [25.11384964373604]
We propose two pretraining approaches to contextualise visual entities in a multimodal setup.
With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions.
With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts.
arXiv Detail & Related papers (2023-05-23T17:27:12Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.