Many-Shot In-Context Learning in Multimodal Foundation Models
- URL: http://arxiv.org/abs/2405.09798v2
- Date: Fri, 04 Oct 2024 21:51:47 GMT
- Title: Many-Shot In-Context Learning in Multimodal Foundation Models
- Authors: Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng,
- Abstract summary: Large language models are effective at few-shot in-context learning (ICL)
Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows.
We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains.
- Score: 4.772535803521769
- License:
- Abstract: Large language models are effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (image classification, visual QA, and object localization). We observe that many-shot ICL, including up to almost 2,000 demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. We also find open-weights multimodal foundation models like Llama 3.2-Vision do not benefit from the demonstrating examples, highlighting an important gap between open and closed multimodal foundation models. Given the high inference costs required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro learns more quickly than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - In-Context Learning with Long-Context Models: An In-Depth Exploration [96.1389740719691]
We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations.
We show that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples.
arXiv Detail & Related papers (2024-04-30T21:06:52Z) - Many-Shot In-Context Learning [58.395589302800566]
Large language models (LLMs) excel at few-shot in-context learning (ICL)
We observe significant performance gains across a wide variety of generative and discriminative tasks.
Unlike few-shot learning, many-shot learning is effective at overriding pretraining biases.
arXiv Detail & Related papers (2024-04-17T02:49:26Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z) - Multi-query Video Retrieval [44.32936301162444]
We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.
We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries.
We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
arXiv Detail & Related papers (2022-01-10T20:44:46Z) - Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks.
Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.