Related papers: Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

URL: http://arxiv.org/abs/2406.14562v1
Date: Thu, 20 Jun 2024 17:59:45 GMT
Title: Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
Authors: Sachit Menon, Richard Zemel, Carl Vondrick,
Abstract summary: We introduce a simple method to unlock the visual reasoning capabilities of multimodal large language models. Whiteboard-of-thought prompting provides models with a metaphorical whiteboard' to draw out reasoning steps as images. This simple approach shows state-of-the-art results on four difficult natural language tasks.
Score: 30.96613796974929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

Related papers

ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. Existing CLIP-based approaches embed images and text independently, and fuse the result. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images [72.42826916932519]
We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
arXiv Detail & Related papers (2024-09-19T17:58:16Z)
Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking [8.02728252625147]
Large Language Model can reasonably understand and generate human expressions but may lack thorough thinking and reasoning mechanisms. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model architecture called TaS. We train the language model by the thoughts-augmented data and successfully let the thinking layer automatically generate reasonable thoughts and finally output more reasonable responses.
arXiv Detail & Related papers (2024-09-18T15:32:48Z)
Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only. Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts. We find that they are not able to generalize well to simple abstract patterns. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z)
Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition. We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z)
Chain of Thought Prompt Tuning in Vision Language Models [29.85907584680661]
We propose a novel chain of thought prompt tuning for vision-language modeling. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings.
arXiv Detail & Related papers (2023-04-16T23:59:25Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images. We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition. How to effectively model linguistic rules in end-to-end deep networks remains a research challenge. We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.