Weakly-Supervised Visual-Textual Grounding with Semantic Prior
Refinement
- URL: http://arxiv.org/abs/2305.10913v2
- Date: Tue, 26 Sep 2023 09:29:26 GMT
- Title: Weakly-Supervised Visual-Textual Grounding with Semantic Prior
Refinement
- Authors: Davide Rigoni and Luca Parolari and Luciano Serafini and Alessandro
Sperduti and Lamberto Ballan
- Abstract summary: Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions.
We propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules.
Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, with a 9.6% absolute improvement.
- Score: 52.80968034977751
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Using only image-sentence pairs, weakly-supervised visual-textual grounding
aims to learn region-phrase correspondences of the respective entity mentions.
Compared to the supervised approach, learning is more difficult since bounding
boxes and textual phrases correspondences are unavailable. In light of this, we
propose the Semantic Prior Refinement Model (SPRM), whose predictions are
obtained by combining the output of two main modules. The first untrained
module aims to return a rough alignment between textual phrases and bounding
boxes. The second trained module is composed of two sub-components that refine
the rough alignment to improve the accuracy of the final phrase-bounding box
alignments. The model is trained to maximize the multimodal similarity between
an image and a sentence, while minimizing the multimodal similarity of the same
sentence and a new unrelated image, carefully selected to help the most during
training. Our approach shows state-of-the-art results on two popular datasets,
Flickr30k Entities and ReferIt, shining especially on ReferIt with a 9.6%
absolute improvement. Moreover, thanks to the untrained component, it reaches
competitive performances just using a small fraction of training examples.
Related papers
- Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model [27.56988000960972]
We introduce a new framework based on a dual context of both domain-shared and class-specific contexts.
Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in Large Language Models.
We also formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens.
arXiv Detail & Related papers (2024-07-05T13:15:29Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - A Better Loss for Visual-Textual Grounding [74.81353762517979]
Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence.
It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution.
We propose a model that is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function.
arXiv Detail & Related papers (2021-08-11T16:26:54Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z) - Contrastive Learning for Unpaired Image-to-Image Translation [64.47477071705866]
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain.
We propose a framework based on contrastive learning to maximize mutual information between the two.
We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time.
arXiv Detail & Related papers (2020-07-30T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.