Related papers: Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

URL: http://arxiv.org/abs/2410.03321v1
Date: Fri, 4 Oct 2024 11:18:41 GMT
Title: Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Authors: Minheng Ni, Yutao Fan, Lei Zhang, Wangmeng Zuo,
Abstract summary: This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
Score: 53.45295657891099
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.

Related papers

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge [45.20691825097646]
We introduce VisualPuzzles, a benchmark that targets visual reasoning. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning.
arXiv Detail & Related papers (2025-04-14T15:50:39Z)
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [26.757458496178437]
We introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. We construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL.
arXiv Detail & Related papers (2025-03-13T17:56:05Z)
Explainable artificial intelligence (XAI): from inherent explainability to large language models [0.0]
Explainable AI (XAI) techniques facilitate the explainability or interpretability of machine learning models. This paper details the advancements of explainable AI methods, from inherently interpretable models to modern approaches. We review explainable AI techniques that leverage vision-language model (VLM) frameworks to automate or improve the explainability of other machine learning models.
arXiv Detail & Related papers (2025-01-17T06:16:57Z)
MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents. Existing benchmarks mostly assess their reasoning abilities in language part. MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts. We find that they are not able to generalize well to simple abstract patterns. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z)
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models [25.058162782167503]
Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. We introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT.
arXiv Detail & Related papers (2023-12-14T09:13:09Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
In-Context Analogical Reasoning with Pre-Trained Language Models [10.344428417489237]
We explore the use of intuitive language-based abstractions to support analogy in AI systems. Specifically, we apply large pre-trained language models (PLMs) to visual Raven's Progressive Matrices ( RPM) We find that PLMs exhibit a striking capacity for zero-shot relational reasoning, exceeding human performance and nearing supervised vision-based methods.
arXiv Detail & Related papers (2023-05-28T04:22:26Z)
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
Social Commonsense Reasoning with Multi-Head Knowledge Attention [24.70946979449572]
Social Commonsense Reasoning requires understanding of text, knowledge about social events and their pragmatic implications, as well as commonsense reasoning skills. We propose a novel multi-head knowledge attention model that encodes semi-structured commonsense inference rules and learns to incorporate them in a transformer-based reasoning cell.
arXiv Detail & Related papers (2020-10-12T10:24:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.