Related papers: Learnable In-Context Vector for Visual Question Answering

Learnable In-Context Vector for Visual Question Answering

URL: http://arxiv.org/abs/2406.13185v1
Date: Wed, 19 Jun 2024 03:33:45 GMT
Title: Learnable In-Context Vector for Visual Question Answering
Authors: Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng,
Abstract summary: We propose textbfLearnable ICV (L-ICV) to distill essential task information from demonstrations, improving ICL performance in Large Multimodal Models (LMMs) Experiments show that L-ICV can significantly reduce computational costs while enhancing accuracy in Visual Question Answering (VQA) tasks compared to traditional ICL and other non-learnable ICV methods.
Score: 37.89141789981324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose \textbf{Learnable ICV} (L-ICV) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that L-ICV can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.

Related papers

M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models [11.542439154523647]
We propose textbfM2IV, a method that substitutes explicit demonstrations with learnable textbfVectors directly integrated into LVLMs. M2IV achieves robust cross-modal fidelity and fine-grained semantic distillation through training. Experiments show that M2IV surpasses Vanilla ICL and prior representation engineering approaches.
arXiv Detail & Related papers (2025-04-06T22:02:21Z)
Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations [0.0]
Multimodal in-context learning (ICL) has emerged as a key capability of Large Vision-Language Models (LVLMs) We shed light on the core mechanism underlying multimodal ICL, identifying task mapping as a crucial factor in configuring robust in-context demonstration sequences. We propose textitSabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention.
arXiv Detail & Related papers (2025-03-05T16:33:10Z)
Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding [71.01099784480597]
Large language models (LLMs) excel at a range of tasks through in-context learning (ICL)<n>We introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping.
arXiv Detail & Related papers (2025-02-19T14:04:46Z)
Implicit In-context Learning [37.0562059811099]
In-context Learning (ICL) empowers large language models to adapt to unseen tasks during inference by prefixing a few demonstration examples prior to test queries. We introduce Implicit In-context Learning (I2CL), an innovative paradigm that addresses the challenges associated with traditional ICL by absorbing demonstration examples within the activation space. I2CL achieves few-shot performance with zero-shot cost and exhibits robustness against the variation of demonstration examples.
arXiv Detail & Related papers (2024-05-23T14:57:52Z)
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL) This study introduces a benchmark VL-ICL Bench for multimodal in-context learning. We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z)
Can MLLMs Perform Text-to-Image In-Context Learning? [11.303734988815016]
The Text-to-Image ICL (T2I-ICL) with its unique characteristics and potential applications remains underexplored. We benchmark six state-of-the-art Multimodal Large Language Models (MLLMs) We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties.
arXiv Detail & Related papers (2024-02-02T10:30:05Z)
kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language. LLMs have achieved impressive performance in computer programs based on a natural language prompt. This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z)
How to Configure Good In-Context Sequence for Visual Question Answering [19.84012680826303]
In this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. We uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.
arXiv Detail & Related papers (2023-12-04T02:03:23Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering [37.334374583093165]
Large language models (LLMs) demonstrate emergent in-context learning capabilities. We propose an alternative approach that recasts in-context learning as in-context vectors (ICV) ICV achieves better performance compared to standard in-context learning.
arXiv Detail & Related papers (2023-11-11T21:19:44Z)
Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs) We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z)
Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs) Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages. The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning. We propose a simple but effective in-context learning framework called ICL-D3IE. Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.