Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations
- URL: http://arxiv.org/abs/2206.15462v4
- Date: Sun, 7 Jan 2024 00:24:21 GMT
- Title: Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations
- Authors: Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez
- Abstract summary: We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
- Score: 58.442103936918805
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a margin-based loss for tuning joint vision-language models so
that their gradient-based explanations are consistent with region-level
annotations provided by humans for relatively smaller grounding datasets. We
refer to this objective as Attention Mask Consistency (AMC) and demonstrate
that it produces superior visual grounding results than previous methods that
rely on using vision-language models to score the outputs of object detectors.
Particularly, a model trained with AMC on top of standard vision-language
modeling objectives obtains a state-of-the-art accuracy of 86.49% in the
Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when
compared to the best previous model trained under the same level of
supervision. Our approach also performs exceedingly well on established
benchmarks for referring expression comprehension where it obtains 80.34%
accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC
is effective, easy to implement, and is general as it can be adopted by any
vision-language model, and can use any type of region annotations.
Related papers
- Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity.
We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions.
The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Self-supervised co-salient object detection via feature correspondence at multiple scales [27.664016341526988]
This paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations.
We train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images.
In experiments on three CoSOD benchmark datasets, our model outperforms the corresponding state-of-the-art models by a huge margin.
arXiv Detail & Related papers (2024-03-17T06:21:21Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning [6.532114018212791]
Fine-tuning vision-language pre-trained models yields competitive or even stronger generalization results.
This challenges the standard of using ImageNet-based transfer learning for domain generalization.
We also find improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set.
arXiv Detail & Related papers (2023-12-04T16:46:38Z) - Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.
Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.