Divide and Conquer: Answering Questions with Object Factorization and
Compositional Reasoning
- URL: http://arxiv.org/abs/2303.10482v1
- Date: Sat, 18 Mar 2023 19:37:28 GMT
- Title: Divide and Conquer: Answering Questions with Object Factorization and
Compositional Reasoning
- Authors: Shi Chen and Qi Zhao
- Abstract summary: We propose an integral framework consisting of a principled object factorization method and a novel neural module network.
Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects.
With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space.
It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions.
- Score: 30.392986232906107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans have the innate capability to answer diverse questions, which is
rooted in the natural ability to correlate different concepts based on their
semantic relationships and decompose difficult problems into sub-tasks. On the
contrary, existing visual reasoning methods assume training samples that
capture every possible object and reasoning problem, and rely on black-boxed
models that commonly exploit statistical priors. They have yet to develop the
capability to address novel objects or spurious biases in real-world scenarios,
and also fall short of interpreting the rationales behind their decisions.
Inspired by humans' reasoning of the visual world, we tackle the aforementioned
challenges from a compositional perspective, and propose an integral framework
consisting of a principled object factorization method and a novel neural
module network. Our factorization method decomposes objects based on their key
characteristics, and automatically derives prototypes that represent a wide
range of objects. With these prototypes encoding important semantics, the
proposed network then correlates objects by measuring their similarity on a
common semantic space and makes decisions with a compositional reasoning
process. It is capable of answering questions with diverse objects regardless
of their availability during training, and overcoming the issues of biased
question-answer distributions. In addition to the enhanced generalizability,
our framework also provides an interpretable interface for understanding the
decision-making process of models. Our code is available at
https://github.com/szzexpoi/POEM.
Related papers
- Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? [62.984473889987605]
We present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system.
We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images.
Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches.
arXiv Detail & Related papers (2024-10-17T15:16:10Z) - Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing [10.206921909332006]
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate.
In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks.
arXiv Detail & Related papers (2024-05-08T20:23:24Z) - Even-if Explanations: Formal Foundations, Priorities and Complexity [18.126159829450028]
We show that both linear and tree-based models are strictly more interpretable than neural networks.
We introduce a preference-based framework that enables users to personalize explanations based on their preferences.
arXiv Detail & Related papers (2024-01-17T11:38:58Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Rotating Features for Object Discovery [74.1465486264609]
We present Rotating Features, a generalization of complex-valued features to higher dimensions, and a new evaluation procedure for extracting objects from distributed representations.
Together, these advancements enable us to scale distributed object-centric representations from simple toy to real-world data.
arXiv Detail & Related papers (2023-06-01T12:16:26Z) - Relate to Predict: Towards Task-Independent Knowledge Representations
for Reinforcement Learning [11.245432408899092]
Reinforcement Learning can enable agents to learn complex tasks.
It is difficult to interpret the knowledge and reuse it across tasks.
In this paper, we introduce an inductive bias for explicit object-centered knowledge separation.
We show that the degree of explicitness in knowledge separation correlates with faster learning, better accuracy, better generalization, and better interpretability.
arXiv Detail & Related papers (2022-12-10T13:33:56Z) - Translational Concept Embedding for Generalized Compositional Zero-shot
Learning [73.60639796305415]
Generalized compositional zero-shot learning means to learn composed concepts of attribute-object pairs in a zero-shot fashion.
This paper introduces a new approach, termed translational concept embedding, to solve these two difficulties in a unified framework.
arXiv Detail & Related papers (2021-12-20T21:27:51Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Separating Skills and Concepts for Novel Visual Question Answering [66.46070380927372]
Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models.
"Skills" are visual tasks, such as counting or attribute recognition, and are applied to "concepts" mentioned in the question.
We present a novel method for learning to compose skills and concepts that separates these two factors implicitly within a model.
arXiv Detail & Related papers (2021-07-19T18:55:10Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z) - CURI: A Benchmark for Productive Concept Learning Under Uncertainty [33.83721664338612]
We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI)
CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning operations, variable binding, etc.
It also defines a model-independent "compositionality gap" to evaluate the difficulty of generalizing out-of-distribution along each of these axes.
arXiv Detail & Related papers (2020-10-06T16:23:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.