Using Visual Cropping to Enhance Fine-Detail Question Answering of
BLIP-Family Models
- URL: http://arxiv.org/abs/2306.00228v1
- Date: Wed, 31 May 2023 22:48:27 GMT
- Title: Using Visual Cropping to Enhance Fine-Detail Question Answering of
BLIP-Family Models
- Authors: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
- Abstract summary: We study whether visual cropping can improve the performance of state-of-the-art visual question answering models on fine-detail questions.
We devise two automatic cropping strategies based on multi-modal embedding by CLIP and BLIP visual QA model gradients.
We gain an improvement of 4.59% (absolute) in the general VQA-random task by simply inputting a concatenation of the original and gradient-based cropped images.
- Score: 6.063024872936599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering is a challenging task, as it requires seamless
interaction between perceptual, linguistic, and background knowledge systems.
While the recent progress of visual and natural language models like BLIP has
led to improved performance on this task, we lack understanding of the ability
of such models to perform on different kinds of questions and reasoning types.
As our initial analysis of BLIP-family models revealed difficulty with
answering fine-detail questions, we investigate the following question: Can
visual cropping be employed to improve the performance of state-of-the-art
visual question answering models on fine-detail questions? Given the recent
success of the BLIP-family models, we study a zero-shot and a fine-tuned BLIP
model. We define three controlled subsets of the popular VQA-v2 benchmark to
measure whether cropping can help model performance. Besides human cropping, we
devise two automatic cropping strategies based on multi-modal embedding by CLIP
and BLIP visual QA model gradients. Our experiments demonstrate that the
performance of BLIP model variants can be significantly improved through human
cropping, and automatic cropping methods can produce comparable benefits. A
deeper dive into our findings indicates that the performance enhancement is
more pronounced in zero-shot models than in fine-tuned models and more salient
with smaller bounding boxes than larger ones. We perform case studies to
connect quantitative differences with qualitative observations across question
types and datasets. Finally, we see that the cropping enhancement is robust, as
we gain an improvement of 4.59% (absolute) in the general VQA-random task by
simply inputting a concatenation of the original and gradient-based cropped
images. We make our code available to facilitate further innovation on visual
cropping methods for question answering.
Related papers
- Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.
Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context
in Visual Question Answering [20.35687327831644]
We study the robustness of Visual Question Answering (VQA) models from a novel perspective: visual context.
SwapMix perturbs the visual context by swapping features of irrelevant context objects with features from other objects in the dataset.
We train the models with perfect sight and find that the context over-reliance highly depends on the quality of visual representations.
arXiv Detail & Related papers (2022-04-05T15:32:25Z) - Dependent Multi-Task Learning with Causal Intervention for Image
Captioning [10.6405791176668]
In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI)
Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning.
Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders.
Finally, we use a multi-agent reinforcement learning strategy to enable end-to-end training and reduce the inter-task error accumulations.
arXiv Detail & Related papers (2021-05-18T14:57:33Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.