LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
- URL: http://arxiv.org/abs/2501.06186v1
- Date: Fri, 10 Jan 2025 18:59:51 GMT
- Title: LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
- Authors: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan,
- Abstract summary: We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.
We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.
Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.
Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
- Score: 103.0226977561914
- License:
- Abstract: Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
Related papers
- Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.
They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.
We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [36.042551817732964]
We introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.
With only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks.
arXiv Detail & Related papers (2024-11-15T18:58:31Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [42.03770972100087]
We introduce a novel visual reasoning framework named ProReason.
ProReason features multi-run proactive perception and decoupled vision-reasoning capabilities.
Our experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models [46.26140720993383]
Multi-LogiEval is a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths.
We conduct evaluations on a range of Large Language Models including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral.
arXiv Detail & Related papers (2024-06-24T23:02:56Z) - Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving.
Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information.
Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models [73.4425450752596]
Chain-of-thought (CoT) prompting has impressively unlocked the reasoning potential of large language models (LLMs)
Yet, the standard CoT is less effective in problems demanding multiple reasoning steps.
We propose RESPROMPT, a new prompting strategy that advances multi-step reasoning in LLMs.
arXiv Detail & Related papers (2023-10-07T08:56:28Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z) - Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples.
We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer.
Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.