A Benchmark for Compositional Visual Reasoning
- URL: http://arxiv.org/abs/2206.05379v1
- Date: Sat, 11 Jun 2022 00:04:49 GMT
- Title: A Benchmark for Compositional Visual Reasoning
- Authors: Aimen Zerroug, Mohit Vaishnav, Julien Colin, Sebastian Musslick,
Thomas Serre
- Abstract summary: We introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards more data-efficient learning algorithms.
We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and associated image datasets at scale.
Our proposed benchmark includes measures of sample efficiency, generalization and transfer across task rules, as well as the ability to leverage compositionality.
- Score: 5.576460160219606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental component of human vision is our ability to parse complex
visual scenes and judge the relations between their constituent objects. AI
benchmarks for visual reasoning have driven rapid progress in recent years with
state-of-the-art systems now reaching human accuracy on some of these
benchmarks. Yet, a major gap remains in terms of the sample efficiency with
which humans and AI systems learn new visual reasoning tasks. Humans'
remarkable efficiency at learning has been at least partially attributed to
their ability to harness compositionality -- such that they can efficiently
take advantage of previously gained knowledge when learning new tasks. Here, we
introduce a novel visual reasoning benchmark, Compositional Visual Relations
(CVR), to drive progress towards the development of more data-efficient
learning algorithms. We take inspiration from fluidic intelligence and
non-verbal reasoning tests and describe a novel method for creating
compositions of abstract rules and associated image datasets at scale. Our
proposed benchmark includes measures of sample efficiency, generalization and
transfer across task rules, as well as the ability to leverage
compositionality. We systematically evaluate modern neural architectures and
find that, surprisingly, convolutional architectures surpass transformer-based
architectures across all performance measures in most data regimes. However,
all computational models are a lot less data efficient compared to humans even
after learning informative visual representations using self-supervision.
Overall, we hope that our challenge will spur interest in the development of
neural architectures that can learn to harness compositionality toward more
efficient learning.
Related papers
- Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z) - JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z) - AIGenC: An AI generalisation model via creativity [1.933681537640272]
Inspired by cognitive theories of creativity, this paper introduces a computational model (AIGenC)
It lays down the necessary components to enable artificial agents to learn, use and generate transferable representations.
We discuss the model's capability to yield better out-of-distribution generalisation in artificial agents.
arXiv Detail & Related papers (2022-05-19T17:43:31Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.