Chain of Images for Intuitively Reasoning
- URL: http://arxiv.org/abs/2311.09241v1
- Date: Thu, 9 Nov 2023 11:14:51 GMT
- Title: Chain of Images for Intuitively Reasoning
- Authors: Fanxu Meng, Haotong Yang, Yiding Wang, Muhan Zhang
- Abstract summary: We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
- Score: 23.692458865558486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The human brain is naturally equipped to comprehend and interpret visual
information rapidly. When confronted with complex problems or concepts, we use
flowcharts, sketches, and diagrams to aid our thought process. Leveraging this
inherent ability can significantly enhance logical reasoning. However, current
Large Language Models (LLMs) do not utilize such visual intuition to help their
thinking. Even the most advanced version language models (e.g., GPT-4V and
LLaVA) merely align images into textual space, which means their reasoning
processes remain purely verbal. To mitigate such limitations, we present a
Chain of Images (CoI) approach, which can convert complex language reasoning
problems to simple pattern recognition by generating a series of images as
intermediate representations. Furthermore, we have developed a CoI evaluation
dataset encompassing 15 distinct domains where images can intuitively aid
problem-solving. Based on this dataset, we aim to construct a benchmark to
assess the capability of future multimodal large-scale models to leverage
images for reasoning. In supporting our CoI reasoning, we introduce a symbolic
multimodal large language model (SyMLLM) that generates images strictly based
on language instructions and accepts both text and image as input. Experiments
on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation
dataset show that CoI improves performance significantly over the pure-language
Chain of Thoughts (CoT) baselines. The code is available at
https://github.com/GraphPKU/CoI.
Related papers
- Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities [30.96613796974929]
We introduce a simple method to unlock the visual reasoning capabilities of multimodal large language models.
Whiteboard-of-thought prompting provides models with a metaphorical whiteboard' to draw out reasoning steps as images.
This simple approach shows state-of-the-art results on four difficult natural language tasks.
arXiv Detail & Related papers (2024-06-20T17:59:45Z) - Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models [61.203151615743366]
We introduce COG, a two-stage framework with COncept-Guided vision-language models.
The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification.
Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability.
arXiv Detail & Related papers (2024-06-16T11:49:00Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Chain of Thought Prompt Tuning in Vision Language Models [29.85907584680661]
We propose a novel chain of thought prompt tuning for vision-language modeling.
We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings.
arXiv Detail & Related papers (2023-04-16T23:59:25Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Visually Grounded Reasoning across Languages and Cultures [27.31020761908739]
We develop a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures.
We focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish.
We create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images.
arXiv Detail & Related papers (2021-09-28T16:51:38Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.