Link-Context Learning for Multimodal LLMs
- URL: http://arxiv.org/abs/2308.07891v1
- Date: Tue, 15 Aug 2023 17:33:24 GMT
- Title: Link-Context Learning for Multimodal LLMs
- Authors: Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu
- Abstract summary: Link-context learning (LCL) emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs.
LCL guides the model to discern not only the analogy but also the underlying causal associations between data points.
To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset.
- Score: 40.923816691928536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to learn from context with novel concepts, and deliver
appropriate responses are essential in human conversations. Despite current
Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being
trained on mega-scale datasets, recognizing unseen images or understanding
novel concepts in a training-free manner remains a challenge. In-Context
Learning (ICL) explores training-free few-shot learning, where models are
encouraged to ``learn to learn" from limited tasks and generalize to unseen
tasks. In this work, we propose link-context learning (LCL), which emphasizes
"reasoning from cause and effect" to augment the learning capabilities of
MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal
relationship between the support set and the query set. By providing
demonstrations with causal links, LCL guides the model to discern not only the
analogy but also the underlying causal associations between data points, which
empowers MLLMs to recognize unseen images and understand novel concepts more
effectively. To facilitate the evaluation of this novel approach, we introduce
the ISEKAI dataset, comprising exclusively of unseen generated image-label
pairs designed for link-context learning. Extensive experiments show that our
LCL-MLLM exhibits strong link-context learning capabilities to novel concepts
over vanilla MLLMs. Code and data will be released at
https://github.com/isekai-portal/Link-Context-Learning.
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder.
To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework.
Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark.
We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - In-Context Exemplars as Clues to Retrieving from Large Associative
Memory [1.2952137350423816]
In-context learning (ICL) enables large language models (LLMs) to learn patterns from in-context exemplars without training.
How to choose exemplars remains unclear due to the lack of understanding of how in-context learning works.
Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval.
arXiv Detail & Related papers (2023-11-06T20:13:29Z) - IERL: Interpretable Ensemble Representation Learning -- Combining
CrowdSourced Knowledge and Distributed Semantic Representations [11.008412414253662]
Large Language Models (LLMs) encode meanings of words in the form of distributed semantics.
Recent studies have shown that LLMs tend to generate unintended, inconsistent, or wrong texts as outputs.
We propose a novel ensemble learning method, Interpretable Ensemble Representation Learning (IERL), that systematically combines LLM and crowdsourced knowledge representations.
arXiv Detail & Related papers (2023-06-24T05:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.