What Makes a Maze Look Like a Maze?
- URL: http://arxiv.org/abs/2409.08202v1
- Date: Thu, 12 Sep 2024 16:41:47 GMT
- Title: What Makes a Maze Look Like a Maze?
- Authors: Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu,
- Abstract summary: We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning.
At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols.
We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
- Score: 92.80800000328277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
Related papers
- Do Vision-Language Models Really Understand Visual Language? [43.893398898373995]
Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image.
Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams.
This paper develops a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs.
arXiv Detail & Related papers (2024-09-30T19:45:11Z) - Text-to-Image Generation for Abstract Concepts [76.32278151607763]
We propose a framework of Text-to-Image generation for Abstract Concepts (TIAC)
The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity.
The concept-dependent form is retrieved from an LLM-extracted form pattern set.
arXiv Detail & Related papers (2023-09-26T02:22:39Z) - Abstract Visual Reasoning with Tangram Shapes [16.51170712669011]
KiloGram is a resource for studying abstract visual reasoning in humans and machines.
It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels.
We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models.
arXiv Detail & Related papers (2022-11-29T18:57:06Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Constellation: Learning relational abstractions over objects for
compositional imagination [64.99658940906917]
We introduce Constellation, a network that learns relational abstractions of static visual scenes.
This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
arXiv Detail & Related papers (2021-07-23T11:59:40Z) - Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs [106.15931418425906]
We present the first study focused on generating natural language rationales across several complex visual reasoning tasks.
We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
arXiv Detail & Related papers (2020-10-15T05:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.