Assessing GPT4-V on Structured Reasoning Tasks
- URL: http://arxiv.org/abs/2312.11524v1
- Date: Wed, 13 Dec 2023 08:54:49 GMT
- Title: Assessing GPT4-V on Structured Reasoning Tasks
- Authors: Mukul Singh, Jos\'e Cambronero, Sumit Gulwani, Vu Le, Gust Verbruggen
- Abstract summary: We show that visual Chain-of-Thought, an extension of Chain-of-Thought to multi-modal LLMs, yields significant improvements over the vanilla model.
We also present a categorized analysis of scenarios where these models perform well and where they struggle, highlighting challenges associated with coherent multimodal reasoning.
- Score: 17.903409875791056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modality promises to unlock further uses for large language models.
Recently, the state-of-the-art language model GPT-4 was enhanced with vision
capabilities. We carry out a prompting evaluation of GPT-4V and five other
baselines on structured reasoning tasks, such as mathematical reasoning, visual
data analysis, and code generation. We show that visual Chain-of-Thought, an
extension of Chain-of-Thought to multi-modal LLMs, yields significant
improvements over the vanilla model. We also present a categorized analysis of
scenarios where these models perform well and where they struggle, highlighting
challenges associated with coherent multimodal reasoning.
Related papers
- Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge [0.5439020425819]
We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations)
Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.
arXiv Detail & Related papers (2024-12-11T11:53:26Z) - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [36.042551817732964]
We introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.
With only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks.
arXiv Detail & Related papers (2024-11-15T18:58:31Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models [56.256069117502385]
Chain of Thought (CoT) approaches can be used to enhance the capability of Large Language Models (LLMs) on complex reasoning tasks.
However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored.
We introduce a novel approach that addresses this challenge by using retrieval mechanisms to automatically select demonstration examples.
arXiv Detail & Related papers (2023-12-04T08:07:21Z) - GLoRE: Evaluating Logical Reasoning of Large Language Models [29.914546407784552]
We introduce GLoRE, a benchmark comprised of 12 datasets that span three different types of tasks.
ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin.
We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM.
arXiv Detail & Related papers (2023-10-13T13:52:15Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.