LIVE: Learnable In-Context Vector for Visual Question Answering
- URL: http://arxiv.org/abs/2406.13185v3
- Date: Thu, 31 Oct 2024 02:35:03 GMT
- Title: LIVE: Learnable In-Context Vector for Visual Question Answering
- Authors: Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng,
- Abstract summary: We develop Large Multimodal Models (LMMs) with In-Context Learning (ICL) capabilities.
Applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs.
We propose Learn In-Context VEctor (LIVE) to distill task information from demonstrations, improving ICL performance in LMMs.
- Score: 37.89141789981324
- License:
- Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose Learnable In-Context VEctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods. The code is available at \url{https://github.com/ForJadeForest/LIVE-Learnable-In-Context-Vector}.
Related papers
- Implicit In-context Learning [37.0562059811099]
In-context Learning (ICL) empowers large language models to adapt to unseen tasks during inference by prefixing a few demonstration examples prior to test queries.
We introduce Implicit In-context Learning (I2CL), an innovative paradigm that addresses the challenges associated with traditional ICL by absorbing demonstration examples within the activation space.
I2CL achieves few-shot performance with zero-shot cost and exhibits robustness against the variation of demonstration examples.
arXiv Detail & Related papers (2024-05-23T14:57:52Z) - VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - Can MLLMs Perform Text-to-Image In-Context Learning? [11.303734988815016]
The Text-to-Image ICL (T2I-ICL) with its unique characteristics and potential applications remains underexplored.
We benchmark six state-of-the-art Multimodal Large Language Models (MLLMs)
We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties.
arXiv Detail & Related papers (2024-02-02T10:30:05Z) - kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest
Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language.
LLMs have achieved impressive performance in computer programs based on a natural language prompt.
This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z) - How to Configure Good In-Context Sequence for Visual Question Answering [19.84012680826303]
In this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations.
Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations.
We uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.
arXiv Detail & Related papers (2023-12-04T02:03:23Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - In-context Vectors: Making In Context Learning More Effective and
Controllable Through Latent Space Steering [37.334374583093165]
Large language models (LLMs) demonstrate emergent in-context learning capabilities.
We propose an alternative approach that recasts in-context learning as in-context vectors (ICV)
ICV achieves better performance compared to standard in-context learning.
arXiv Detail & Related papers (2023-11-11T21:19:44Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.