CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
Models with Multiple Image Inputs
- URL: http://arxiv.org/abs/2401.02582v1
- Date: Fri, 5 Jan 2024 00:26:07 GMT
- Title: CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
Models with Multiple Image Inputs
- Authors: Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai
Chen, Jiebo Luo
- Abstract summary: The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching.
We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
- Score: 48.269363759989915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When exploring the development of Artificial General Intelligence (AGI), a
critical task for these models involves interpreting and processing information
from multiple image inputs. However, Large Multimodal Models (LMMs) encounter
two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a
tendency to blend information across multiple images. We first extensively
investigate the capability of LMMs to perceive fine-grained visual details when
dealing with multiple input images. The research focuses on two aspects: first,
image-to-image matching (to evaluate whether LMMs can effectively reason and
pair relevant images), and second, multi-image-to-text matching (to assess
whether LMMs can accurately capture and summarize detailed image information).
We conduct evaluations on a range of both open-source and closed-source large
models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model
performance, we further develop a Contrastive Chain-of-Thought (CoCoT)
prompting approach based on multi-input multimodal models. This method requires
LMMs to compare the similarities and differences among multiple image inputs,
and then guide the models to answer detailed questions about multi-image inputs
based on the identified similarities and differences. Our experimental results
showcase CoCoT's proficiency in enhancing the multi-image comprehension
capabilities of large multimodal models.
Related papers
- MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.
We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.