AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation
for Artificial Cognition
- URL: http://arxiv.org/abs/2110.05836v1
- Date: Tue, 12 Oct 2021 08:59:19 GMT
- Title: AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation
for Artificial Cognition
- Authors: Arijit Dasgupta, Jiafei Duan, Marcelo H. Ang Jr, Cheston Tan
- Abstract summary: Violation-of-Expectation (VoE) is used to evaluate models' ability to discriminate between expected and surprising scenes.
Existing VoE-based 3D datasets in physical reasoning only provide vision data.
We propose AVoE: a synthetic 3D VoE-based dataset that presents stimuli from multiple novel sub-categories for five event categories of physical reasoning.
- Score: 2.561649173827544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work in cognitive reasoning and computer vision has engendered an
increasing popularity for the Violation-of-Expectation (VoE) paradigm in
synthetic datasets. Inspired by work in infant psychology, researchers have
started evaluating a model's ability to discriminate between expected and
surprising scenes as a sign of its reasoning ability. Existing VoE-based 3D
datasets in physical reasoning only provide vision data. However, current
cognitive models of physical reasoning by psychologists reveal infants create
high-level abstract representations of objects and interactions. Capitalizing
on this knowledge, we propose AVoE: a synthetic 3D VoE-based dataset that
presents stimuli from multiple novel sub-categories for five event categories
of physical reasoning. Compared to existing work, AVoE is armed with
ground-truth labels of abstract features and rules augmented to vision data,
paving the way for high-level symbolic predictions in physical reasoning tasks.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Learning 3D object-centric representation through prediction [12.008668555280668]
We develop a novel network architecture that learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth.
The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes.
arXiv Detail & Related papers (2024-03-06T14:19:11Z) - Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities.
This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z) - X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events [75.94926117990435]
This study introduces X-VoE, a benchmark dataset to assess AI agents' grasp of intuitive physics.
X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models.
We present an explanation-based learning system that captures physics dynamics and infers occluded object states.
arXiv Detail & Related papers (2023-08-21T03:28:23Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Objaverse: A Universe of Annotated 3D Objects [53.2537614157313]
We present averse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive tags, captions and animations.
We demonstrate the large potential of averse 3D models via four applications: training diverse 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied vision models, and creating a new benchmark for robustness analysis of vision models.
arXiv Detail & Related papers (2022-12-15T18:56:53Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - A Benchmark for Modeling Violation-of-Expectation in Physical Reasoning
Across Event Categories [4.4920673251997885]
Violation-of-Expectation (VoE) is used to label scenes as either expected or surprising with knowledge of only expected scenes.
Existing VoE-based 3D datasets in physical reasoning provide mainly vision data with little to no-truths or inductive biases.
We set up a benchmark to study physical reasoning by curating a novel large-scale synthetic 3D VoE dataset armed with ground-truth labels of causally relevant features and rules.
arXiv Detail & Related papers (2021-11-16T22:59:25Z) - Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects.
Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input.
We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding [33.68455617113953]
We present a 3D AffordanceNet dataset, a benchmark of 23k shapes from 23 semantic object categories, annotated with 18 visual affordance categories.
Three state-of-the-art point cloud deep learning networks are evaluated on all tasks.
arXiv Detail & Related papers (2021-03-30T14:46:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.