Zero-shot Commonsense Reasoning over Machine Imagination
        - URL: http://arxiv.org/abs/2410.09329v1
- Date: Sat, 12 Oct 2024 02:15:11 GMT
- Title: Zero-shot Commonsense Reasoning over Machine Imagination
- Authors: Hyuntae Park, Yeachan Kim, Jun-Hyung Park, SangKeun Lee, 
- Abstract summary: We propose Imagine, a novel zero-shot commonsense reasoning framework designed to complement textual inputs with visual signals derived from machine-generated images.
We show that Imagine outperforms existing methods by a large margin, highlighting the strength of machine imagination in mitigating reporting bias and enhancing generalization capabilities.
- Score: 14.350718566829343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recent approaches to zero-shot commonsense reasoning have enabled Pre-trained Language Models (PLMs) to learn a broad range of commonsense knowledge without being tailored to specific situations. However, they often suffer from human reporting bias inherent in textual commonsense knowledge, leading to discrepancies in understanding between PLMs and humans. In this work, we aim to bridge this gap by introducing an additional information channel to PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework designed to complement textual inputs with visual signals derived from machine-generated images. To achieve this, we enhance PLMs with imagination capabilities by incorporating an image generator into the reasoning process. To guide PLMs in effectively leveraging machine imagination, we create a synthetic pre-training dataset that simulates visual question-answering. Our extensive experiments on diverse reasoning benchmarks and analysis show that Imagine outperforms existing methods by a large margin, highlighting the strength of machine imagination in mitigating reporting bias and enhancing generalization capabilities. 
 
      
        Related papers
        - Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs   with Vision Foundation Models [10.1080193179562]
 Current understanding models excel at recognizing "what" but fall short in high-level cognitive tasks like causal reasoning and future prediction.<n>We propose a novel framework that fuses a powerful Vision Foundation Model for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core.
 arXiv  Detail & Related papers  (2025-07-08T09:43:17Z)
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual   Tokens [44.19323180593379]
 Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning.<n>Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability.<n>We present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text.
 arXiv  Detail & Related papers  (2025-06-20T17:59:31Z)
- Interpretable and Reliable Detection of AI-Generated Images via Grounded   Reasoning in MLLMs [43.08776932101172]
 We build a dataset of AI-generated images annotated with bounding boxes and descriptive captions.<n>We then finetune MLLMs through a multi-stage optimization strategy.<n>The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws.
 arXiv  Detail & Related papers  (2025-06-08T08:47:44Z)
- Perceptual Decoupling for Scalable Multi-modal Reasoning via   Reward-Optimized Captioning [78.17782197231325]
 We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
 arXiv  Detail & Related papers  (2025-06-05T02:28:07Z)
- DeepEyes: Incentivizing "Thinking with Images" via Reinforcement   Learning [11.242852367476015]
 DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
 arXiv  Detail & Related papers  (2025-05-20T13:48:11Z)
- CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography [12.305953690308085]
 Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence.
Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, have opened this capability.
We focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics interplay with the camera parameters.
 arXiv  Detail & Related papers  (2025-04-14T10:53:44Z)
- Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon   Robotic Manipulation [90.00687889213991]
 Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.
Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.
In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
 arXiv  Detail & Related papers  (2025-02-23T20:42:15Z)
- Enhancing Visual Reasoning with Autonomous Imagination in Multimodal   Large Language Models [27.78471707423076]
 We propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status.
We introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform.
We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement.
 arXiv  Detail & Related papers  (2024-11-27T08:44:25Z)
- Learning to Ground VLMs without Forgetting [54.033346088090674]
 We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
 arXiv  Detail & Related papers  (2024-10-14T13:35:47Z)
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World   Knowledge [60.76719375410635]
 We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
 arXiv  Detail & Related papers  (2024-05-15T21:55:31Z)
- What if...?: Thinking Counterfactual Keywords Helps to Mitigate   Hallucination in Large Multi-modal Models [50.97705264224828]
 We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models.
We aim for the models to engage with and generate responses that span a wider contextual scene understanding.
 Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
 arXiv  Detail & Related papers  (2024-03-20T11:27:20Z)
- MOKA: Open-World Robotic Manipulation through Mark-Based Visual   Prompting [97.52388851329667]
 We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
 arXiv  Detail & Related papers  (2024-03-05T18:08:45Z)
- In-Context Analogical Reasoning with Pre-Trained Language Models [10.344428417489237]
 We explore the use of intuitive language-based abstractions to support analogy in AI systems.
Specifically, we apply large pre-trained language models (PLMs) to visual Raven's Progressive Matrices ( RPM)
We find that PLMs exhibit a striking capacity for zero-shot relational reasoning, exceeding human performance and nearing supervised vision-based methods.
 arXiv  Detail & Related papers  (2023-05-28T04:22:26Z)
- See, Think, Confirm: Interactive Prompting Between Vision and Language
  Models for Knowledge-based Visual Reasoning [60.43585179885355]
 We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
 arXiv  Detail & Related papers  (2023-01-12T18:59:50Z)
- ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural
  Language Generation [53.56628907030751]
 We propose ImaginE, an imagination-based automatic evaluation metric for natural language generation.
With the help of CLIP and DALL-E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet.
Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation.
 arXiv  Detail & Related papers  (2021-06-10T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.