Exploring Diverse In-Context Configurations for Image Captioning
- URL: http://arxiv.org/abs/2305.14800v6
- Date: Tue, 23 Jan 2024 04:01:43 GMT
- Title: Exploring Diverse In-Context Configurations for Image Captioning
- Authors: Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng
- Abstract summary: In this paper, we explore the effects of varying configurations on Vision-Language (VL) in-context learning.
We devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning.
Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning.
- Score: 39.54017777410428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: After discovering that Language Models (LMs) can be good in-context few-shot
learners, numerous strategies have been proposed to optimize in-context
sequence configurations. Recently, researchers in Vision-Language (VL) domains
also develop their few-shot learners, while they only use the simplest way,
ie., randomly sampling, to configure in-context image-text pairs. In order to
explore the effects of varying configurations on VL in-context learning, we
devised four strategies for image selection and four for caption assignment to
configure in-context image-text pairs for image captioning. Here Image
Captioning is used as the case study since it can be seen as the
visually-conditioned LM. Our comprehensive experiments yield two
counter-intuitive but valuable insights, highlighting the distinct
characteristics of VL in-context learning due to multi-modal synergy, as
compared to the NLP case. Furthermore, in our exploration of optimal
combination strategies, we observed an average performance enhancement of 20.9
of CIDEr scores compared to the baseline. The code is given in
https://github.com/yongliang-wu/ExploreCfg.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks.
We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods.
Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z) - Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions [17.934227561793474]
Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text.
We propose ContextBLIP, a method that relies on a doubly contextual alignment scheme for challenging IRCD.
We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
arXiv Detail & Related papers (2024-05-29T16:06:21Z) - How to Configure Good In-Context Sequence for Visual Question Answering [19.84012680826303]
In this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations.
Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations.
We uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.
arXiv Detail & Related papers (2023-12-04T02:03:23Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - VL-LTR: Learning Class-wise Visual-Linguistic Representation for
Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR.
Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions.
Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.