Understanding and Improving In-Context Learning on Vision-language
Models
- URL: http://arxiv.org/abs/2311.18021v1
- Date: Wed, 29 Nov 2023 19:08:11 GMT
- Title: Understanding and Improving In-Context Learning on Vision-language
Models
- Authors: Shuo Chen, Zhen Han, Bailan He, Mark Buckley, Philip Torr, Volker
Tresp, Jindong Gu
- Abstract summary: In-context learning (ICL) on large language models (LLMs) has received great attention, and this technique can be applied to vision-language models (VLMs)
This study investigates the significance of both visual and language information.
We propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES)
- Score: 42.7212469140844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, in-context learning (ICL) on large language models (LLMs) has
received great attention, and this technique can also be applied to
vision-language models (VLMs) built upon LLMs. These VLMs can respond to
queries by conditioning responses on a series of multimodal demonstrations,
which comprise images, queries, and answers. Though ICL has been extensively
studied on LLMs, its research on VLMs remains limited. The inclusion of
additional visual information in the demonstrations motivates the following
research questions: which of the two modalities in the demonstration is more
significant? How can we select effective multimodal demonstrations to enhance
ICL performance? This study investigates the significance of both visual and
language information. Our findings indicate that ICL in VLMs is predominantly
driven by the textual information in the demonstrations whereas the visual
information in the demonstrations barely affects the ICL performance.
Subsequently, we provide an understanding of the findings by analyzing the
model information flow and comparing model inner states given different ICL
settings. Motivated by our analysis, we propose a simple yet effective
approach, termed Mixed Modality In-Context Example Selection (MMICES), which
considers both visual and language modalities when selecting demonstrations and
shows better ICL performance. Extensive experiments are conducted to support
our findings, understanding, and improvement of the ICL performance of VLMs.
Related papers
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark.
We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities.
We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition.
Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z) - Comparable Demonstrations are Important in In-Context Learning: A Novel
Perspective on Demonstration Selection [22.29452683679149]
In-Context Learning (ICL) is an important paradigm for adapting Large Language Models (LLMs) to downstream tasks through a few demonstrations.
This study explores the ICL mechanisms from a novel perspective, providing a deeper insight into the demonstration selection strategy for ICL.
arXiv Detail & Related papers (2023-12-12T18:05:46Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Exploring the Relationship between In-Context Learning and Instruction
Tuning [18.186126518966017]
In-Context Learning (ICL) and Instruction Tuning (IT) are two primary paradigms of adopting Large Language Models to downstream applications.
In ICL, a set of demonstrations are provided at inference time but the LLM's parameters are not updated.
In IT, a set of demonstrations are used to tune LLM's parameters in training time but no demonstrations are used at inference time.
arXiv Detail & Related papers (2023-11-17T07:40:46Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.