Related papers: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

URL: http://arxiv.org/abs/2206.15462v4
Date: Sun, 7 Jan 2024 00:24:21 GMT
Title: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Authors: Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez
Abstract summary: We show that Attention Mask Consistency produces superior visual grounding results than previous methods. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
Score: 58.442103936918805
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.

Related papers

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images [2.2124795371148616]
We evaluate vision tramsformers pre-trained with masked image modelling (MIM) against synthetic out-of-distribution (OOD) benchmarks. Experiments demonstrate BEIT's known robustness while maintaining 94% accuracy on PACS and 87% on Office-Home, despite significant occlusions. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.
arXiv Detail & Related papers (2025-04-05T16:25:34Z)
Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation [19.673388630963807]
We present TailoredBench, a method that conducts customized evaluation tailored to each target model. A Global-coreset is first constructed as a probe to identify the most consistent source models for each target model. A scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model.
arXiv Detail & Related papers (2025-02-19T09:31:50Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity. We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z)
Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z)
Self-supervised co-salient object detection via feature correspondence at multiple scales [27.664016341526988]
This paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations. We train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images. In experiments on three CoSOD benchmark datasets, our model outperforms the corresponding state-of-the-art models by a huge margin.
arXiv Detail & Related papers (2024-03-17T06:21:21Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning [6.532114018212791]
Fine-tuning vision-language pre-trained models yields competitive or even stronger generalization results. This challenges the standard of using ImageNet-based transfer learning for domain generalization. We also find improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set.
arXiv Detail & Related papers (2023-12-04T16:46:38Z)
Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model. Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z)
GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks. This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z)
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept. We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z)
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision [106.77639982059014]
We present ConST-CL framework to effectively learn-temporally fine-grained representations. We first design a region-based self-supervised task which requires the model to learn to transform instance representations from one view to another guided by context features. We then introduce a simple design that effectively reconciles the simultaneous learning of both holistic and local representations.
arXiv Detail & Related papers (2021-12-09T19:13:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.