Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations
- URL: http://arxiv.org/abs/2206.15462v4
- Date: Sun, 7 Jan 2024 00:24:21 GMT
- Title: Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations
- Authors: Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez
- Abstract summary: We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
- Score: 58.442103936918805
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a margin-based loss for tuning joint vision-language models so
that their gradient-based explanations are consistent with region-level
annotations provided by humans for relatively smaller grounding datasets. We
refer to this objective as Attention Mask Consistency (AMC) and demonstrate
that it produces superior visual grounding results than previous methods that
rely on using vision-language models to score the outputs of object detectors.
Particularly, a model trained with AMC on top of standard vision-language
modeling objectives obtains a state-of-the-art accuracy of 86.49% in the
Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when
compared to the best previous model trained under the same level of
supervision. Our approach also performs exceedingly well on established
benchmarks for referring expression comprehension where it obtains 80.34%
accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC
is effective, easy to implement, and is general as it can be adopted by any
vision-language model, and can use any type of region annotations.
Related papers
- Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation [31.61985215677114]
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data.
Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities.
This is a preprint technical report with thorough evaluations to understand the entire process.
arXiv Detail & Related papers (2024-06-21T08:29:31Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Self-supervised co-salient object detection via feature correspondence at multiple scales [27.664016341526988]
This paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations.
We train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images.
In experiments on three CoSOD benchmark datasets, our model outperforms the corresponding state-of-the-art models by a huge margin.
arXiv Detail & Related papers (2024-03-17T06:21:21Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Robust Fine-Tuning of Vision-Language Models for Domain Generalization [6.7181844004432385]
Foundation models have impressive zero-shot inference capabilities and robustness under distribution shifts.
We present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP.
Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts.
arXiv Detail & Related papers (2023-11-03T20:50:40Z) - Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.
Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z) - Contextualized Spatio-Temporal Contrastive Learning with
Self-Supervision [106.77639982059014]
We present ConST-CL framework to effectively learn-temporally fine-grained representations.
We first design a region-based self-supervised task which requires the model to learn to transform instance representations from one view to another guided by context features.
We then introduce a simple design that effectively reconciles the simultaneous learning of both holistic and local representations.
arXiv Detail & Related papers (2021-12-09T19:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.