How to Configure Good In-Context Sequence for Visual Question Answering
- URL: http://arxiv.org/abs/2312.01571v1
- Date: Mon, 4 Dec 2023 02:03:23 GMT
- Title: How to Configure Good In-Context Sequence for Visual Question Answering
- Authors: Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang
- Abstract summary: In this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations.
Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations.
We uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.
- Score: 19.84012680826303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the success of Large Language Models in dealing with new tasks
via In-Context Learning (ICL) in NLP, researchers have also developed Large
Vision-Language Models (LVLMs) with ICL capabilities. However, when
implementing ICL using these LVLMs, researchers usually resort to the simplest
way like random sampling to configure the in-context sequence, thus leading to
sub-optimal results. To enhance the ICL performance, in this study, we use
Visual Question Answering (VQA) as case study to explore diverse in-context
configurations to find the powerful ones. Additionally, through observing the
changes of the LVLM outputs by altering the in-context sequence, we gain
insights into the inner properties of LVLMs, improving our understanding of
them. Specifically, to explore in-context configurations, we design diverse
retrieval methods and employ different strategies to manipulate the retrieved
demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2,
VizWiz, and OK-VQA, we uncover three important inner properties of the applied
LVLM and demonstrate which strategies can consistently improve the ICL VQA
performance. Our code is provided in:
https://github.com/GaryJiajia/OFv2_ICL_VQA.
Related papers
- Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues.
Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images.
This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z) - LIVE: Learnable In-Context Vector for Visual Question Answering [37.89141789981324]
We develop Large Multimodal Models (LMMs) with In-Context Learning (ICL) capabilities.
Applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs.
We propose Learn In-Context VEctor (LIVE) to distill task information from demonstrations, improving ICL performance in LMMs.
arXiv Detail & Related papers (2024-06-19T03:33:45Z) - Is In-Context Learning Sufficient for Instruction Following in LLMs? [38.29072578390376]
We show that, while effective, ICL alignment withAL still underperforms compared to instruction fine-tuning on the established benchmark MT-Bench.
We provide the first, to our knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for instruction following in the low data regime.
arXiv Detail & Related papers (2024-05-30T09:28:56Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks.
We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest
Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language.
LLMs have achieved impressive performance in computer programs based on a natural language prompt.
This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z) - Learning to Retrieve In-Context Examples for Large Language Models [69.9707552694766]
Large language models (LLMs) have demonstrated their ability to learn in-context.
The effectiveness of in-context learning is heavily reliant on the quality of the selected examples.
We propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples.
arXiv Detail & Related papers (2023-07-14T05:23:08Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.