Improved Visual Grounding through Self-Consistent Explanations
- URL: http://arxiv.org/abs/2312.04554v1
- Date: Thu, 7 Dec 2023 18:59:22 GMT
- Title: Improved Visual Grounding through Self-Consistent Explanations
- Authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg,
Vicente Ordonez
- Abstract summary: We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model.
SelfEQ is a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency.
- Score: 58.51131933246332
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-and-language models trained to match images with text can be combined
with visual explanation methods to point to the locations of specific objects
in an image. Our work shows that the localization --"grounding"-- abilities of
these models can be further improved by finetuning for self-consistent visual
explanations. We propose a strategy for augmenting existing text-image datasets
with paraphrases using a large language model, and SelfEQ, a weakly-supervised
strategy on visual explanation maps for paraphrases that encourages
self-consistency. Specifically, for an input textual phrase, we attempt to
generate a paraphrase and finetune the model so that the phrase and paraphrase
map to the same region in the image. We posit that this both expands the
vocabulary that the model is able to handle, and improves the quality of the
object locations highlighted by gradient-based visual explanation methods (e.g.
GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k,
ReferIt, and RefCOCO+ over a strong baseline method and several prior works.
Particularly, comparing to other methods that do not use any type of box
annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%),
67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on
RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on
average).
Related papers
- TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.
Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.
Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation [9.552642210681489]
We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.
We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
arXiv Detail & Related papers (2023-10-25T14:10:08Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.