Related papers: Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

URL: http://arxiv.org/abs/2506.09473v1
Date: Wed, 11 Jun 2025 07:38:12 GMT
Title: Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning
Authors: Cheng Chen, Yunpeng Zhai, Yifan Zhao, Jinyang Gao, Bolin Ding, Jia Li,
Abstract summary: This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection.<n>We propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole.
Score: 45.06983025267863
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.

Related papers

DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer [50.64531021352504]
Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks.<n>Existing approaches typically rely on example selection, including in some agentic or multi-step settings.<n>We propose DICE, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning.
arXiv Detail & Related papers (2025-07-31T13:42:14Z)
True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
Learning to Select In-Context Demonstration Preferred by Large Language Model [21.077656767563255]
In-context learning (ICL) enables large language models to adapt to new tasks during inference using only a few demonstrations.<n>We propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL.<n>Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations.
arXiv Detail & Related papers (2025-05-26T13:26:56Z)
Demonstration Selection for In-Context Learning via Reinforcement Learning [16.103533806505403]
Relevance-Diversity Enhanced Selection (RDES) is an innovative approach to optimize the selection of diverse reference demonstrations.<n>RDES employs frameworks like Q-learning and a PPO-based variant to dynamically identify demonstrations that maximize diversity.<n>We demonstrate that RDES significantly enhances performance compared to ten established baselines.
arXiv Detail & Related papers (2024-12-05T08:33:52Z)
Improving Generalization in Visual Reasoning via Self-Ensemble [0.0]
We propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters. Our key insight is that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities.
arXiv Detail & Related papers (2024-10-28T10:04:40Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Large Language Models Know What Makes Exemplary Contexts [42.90814615222177]
In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs) This paper presents a unified framework for LLMs that allows them to self-select influential in-context examples to compose their contexts.
arXiv Detail & Related papers (2024-08-14T12:32:41Z)
Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task. We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z)
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? [11.374310255084753]
We introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.
arXiv Detail & Related papers (2024-04-19T13:05:37Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z)
Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs) Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages. The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.