Related papers: Can MLLMs Perform Text-to-Image In-Context Learning?

Can MLLMs Perform Text-to-Image In-Context Learning?

URL: http://arxiv.org/abs/2402.01293v3
Date: Sat, 20 Jul 2024 07:52:29 GMT
Title: Can MLLMs Perform Text-to-Image In-Context Learning?
Authors: Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee,
Abstract summary: The Text-to-Image ICL (T2I-ICL) with its unique characteristics and potential applications remains underexplored. We benchmark six state-of-the-art Multimodal Large Language Models (MLLMs) We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties.
Score: 11.303734988815016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT.

Related papers

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Multimodal Contrastive In-Context Learning [0.9120312014267044]
This paper introduces a novel multimodal contrastive in-context learning framework to enhance our understanding of gradient-free in-context learning (ICL) in Large Language Models (LLMs) First, we present a contrastive learning-based interpretation of ICL in real-world settings, marking the distance of the key-value representation as the differentiator in ICL. Second, we develop an analytical framework to address biases in multimodal input formatting for real-world datasets. Third, we propose an on-the-fly approach for ICL that demonstrates effectiveness in detecting hateful memes.
arXiv Detail & Related papers (2024-08-23T10:10:01Z)
LIVE: Learnable In-Context Vector for Visual Question Answering [37.89141789981324]
We develop Large Multimodal Models (LMMs) with In-Context Learning (ICL) capabilities. Applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. We propose Learn In-Context VEctor (LIVE) to distill task information from demonstrations, improving ICL performance in LMMs.
arXiv Detail & Related papers (2024-06-19T03:33:45Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
What Makes Multimodal In-Context Learning Work? [58.48612721156335]
We present a framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. We identify several biases and limitations of M-ICL that warrant consideration prior to deployment.
arXiv Detail & Related papers (2024-04-24T08:50:45Z)
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL) This study introduces a benchmark VL-ICL Bench for multimodal in-context learning. We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z)
MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images [24.17147521556083]
In-context learning has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems. We are the first to use end-to-end prompting method for this task.
arXiv Detail & Related papers (2023-09-09T13:35:01Z)
A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot. This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.