VQA4CIR: Boosting Composed Image Retrieval with Visual Question
Answering
- URL: http://arxiv.org/abs/2312.12273v1
- Date: Tue, 19 Dec 2023 15:56:08 GMT
- Title: VQA4CIR: Boosting Composed Image Retrieval with Visual Question
Answering
- Authors: Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo,
Xinxing Xu, Rick Siow Mong Goh, Yong Liu
- Abstract summary: This work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR.
The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods.
Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
- Score: 68.47402250389685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Albeit progress has been made in Composed Image Retrieval (CIR), we
empirically find that a certain percentage of failure retrieval results are not
consistent with their relative captions. To address this issue, this work
provides a Visual Question Answering (VQA) perspective to boost the performance
of CIR. The resulting VQA4CIR is a post-processing approach and can be directly
plugged into existing CIR methods. Given the top-C retrieved images by a CIR
method, VQA4CIR aims to decrease the adverse effect of the failure retrieval
results being inconsistent with the relative caption. To find the retrieved
images inconsistent with the relative caption, we resort to the "QA generation
to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning
LLM (e.g., LLaMA) to generate several pairs of questions and answers from each
relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model.
By feeding the retrieved image and question to the VQA model, one can find the
images inconsistent with relative caption when the answer by VQA is
inconsistent with the answer in the QA pair. Consequently, the CIR performance
can be boosted by modifying the ranks of inconsistently retrieved images.
Experimental results show that our proposed method outperforms state-of-the-art
CIR methods on the CIRR and Fashion-IQ datasets.
Related papers
- Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - Learning Conditional Knowledge Distillation for Degraded-Reference Image
Quality Assessment [157.1292674649519]
We propose a practical solution named degraded-reference IQA (DR-IQA)
DR-IQA exploits the inputs of IR models, degraded images, as references.
Our results can even be close to the performance of full-reference settings.
arXiv Detail & Related papers (2021-08-18T02:35:08Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z) - Image Quality Assessment for Perceptual Image Restoration: A New
Dataset, Benchmark and Metric [19.855042248822738]
Image quality assessment (IQA) is the key factor for the fast development of image restoration (IR) algorithms.
Recent IR algorithms based on generative adversarial networks (GANs) have brought in significant improvement on visual performance.
We present two questions: Can existing IQA methods objectively evaluate recent IR algorithms?
arXiv Detail & Related papers (2020-11-30T17:06:46Z) - Logically Consistent Loss for Visual Question Answering [66.83963844316561]
The current advancement in neural-network based Visual Question Answering (VQA) cannot ensure such consistency due to identically distribution (i.i.d.) assumption.
We propose a new model-agnostic logic constraint to tackle this issue by formulating a logically consistent loss in the multi-task learning framework.
Experiments confirm that the proposed loss formulae and introduction of hybrid-batch leads to more consistency as well as better performance.
arXiv Detail & Related papers (2020-11-19T20:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.