SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
- URL: http://arxiv.org/abs/2409.14083v1
- Date: Sat, 21 Sep 2024 09:36:14 GMT
- Title: SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
- Authors: Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng,
- Abstract summary: Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing.
Currently, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references.
We propose a self-refinement framework designed to teach LVLMs to Selectively Retrieved Information (SURf)
- Score: 26.049228685973667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their robustness against irrelevant or misleading information. The source code is available at https://github.com/GasolSun36/SURf.
Related papers
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering [44.008094698200026]
We introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources.
Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge.
This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.
arXiv Detail & Related papers (2024-11-25T19:01:03Z) - Making Large Vision Language Models to be Good Few-shot Learners [11.204701216476815]
Few-shot classification (FSC) is a fundamental yet challenging task in computer vision.
LVLMs risk learning specific response formats rather than effectively extracting useful information from support data.
In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases.
arXiv Detail & Related papers (2024-08-21T03:01:11Z) - SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge [56.772051051558215]
Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently.
We propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs.
arXiv Detail & Related papers (2024-05-23T13:32:07Z) - When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively [3.705145020383824]
We show how Large Language Models (LLMs) can learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question.
arXiv Detail & Related papers (2024-04-30T16:52:55Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Enhancing In-Context Learning with Answer Feedback for Multi-Span
Question Answering [9.158919909909146]
In this paper, we propose a novel way of employing labeled data such as it informs LLM of some undesired output.
Experiments on three multi-span question answering datasets and a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM's in-context learning performance.
arXiv Detail & Related papers (2023-06-07T15:20:24Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.