Related papers: Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

URL: http://arxiv.org/abs/2507.08021v1
Date: Tue, 08 Jul 2025 08:07:57 GMT
Title: Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis
Authors: Li Li, Yongliang Wu, Jingze Zhu, Jiawei Peng, Jianfei Cai, Xu Yang,
Abstract summary: In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of In-Context Learning (ICL)<n>Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities.<n>This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task.
Score: 28.52057785196361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.

Related papers

Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning [12.054910727620154]
Eye-tracking data reveals valuable insights into users' cognitive states but is difficult to analyze due to its structured, non-linguistic nature.<n>This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals.
arXiv Detail & Related papers (2025-07-24T09:49:53Z)
Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales [7.119479942471737]
Existing methods rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text.<n>We propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA.
arXiv Detail & Related papers (2025-05-20T15:28:26Z)
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z)
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning [39.54891426369773]
Trade-offs between model size, architecture, and performance remain underexplored.<n>In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones.<n>To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures.
arXiv Detail & Related papers (2025-03-19T18:10:12Z)
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis [2.1329326061804816]
This paper introduces Large Language Models (LLMs) for event decomposition and proposes a reinforcement learning framework for Multimodal Aspect-based Sentiment Analysis (MABSA-RL) Experimental results show that MABSA-RL outperforms existing advanced methods on two benchmark datasets.
arXiv Detail & Related papers (2024-10-18T03:40:45Z)
From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning [47.82447085244952]
We show that modalities matter differently across tasks in multimodal ICL.<n> Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance.
arXiv Detail & Related papers (2024-07-01T01:57:21Z)
Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
In-context segmentation aims to segment objects using given reference images.<n>Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries.<n>This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model for in-context segmentation.
arXiv Detail & Related papers (2024-03-14T17:52:31Z)
Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs) Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages. The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
Multilingual Multi-Aspect Explainability Analyses on Machine Reading Comprehension Models [76.48370548802464]
This paper focuses on conducting a series of analytical experiments to examine the relations between the multi-head self-attention and the final MRC system performance. We discover that passage-to-question and passage understanding attentions are the most important ones in the question answering process. Through comprehensive visualizations and case studies, we also observe several general findings on the attention maps, which can be helpful to understand how these models solve the questions.
arXiv Detail & Related papers (2021-08-26T04:23:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.