Disentangling What and Where for 3D Object-Centric Representations
Through Active Inference
- URL: http://arxiv.org/abs/2108.11762v1
- Date: Thu, 26 Aug 2021 12:49:07 GMT
- Title: Disentangling What and Where for 3D Object-Centric Representations
Through Active Inference
- Authors: Toon Van de Maele, Tim Verbelen, Ozan Catal and Bart Dhoedt
- Abstract summary: We propose an active inference agent that can learn novel object categories over time.
We show that our agent is able to learn representations for many object categories in an unsupervised way.
We validate our system in an end-to-end fashion where the agent is able to search for an object at a given pose from a pixel-based rendering.
- Score: 4.088019409160893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although modern object detection and classification models achieve high
accuracy, these are typically constrained in advance on a fixed train set and
are therefore not flexible to deal with novel, unseen object categories.
Moreover, these models most often operate on a single frame, which may yield
incorrect classifications in case of ambiguous viewpoints. In this paper, we
propose an active inference agent that actively gathers evidence for object
classifications, and can learn novel object categories over time. Drawing
inspiration from the human brain, we build object-centric generative models
composed of two information streams, a what- and a where-stream. The
what-stream predicts whether the observed object belongs to a specific
category, while the where-stream is responsible for representing the object in
its internal 3D reference frame. We show that our agent (i) is able to learn
representations for many object categories in an unsupervised way, (ii)
achieves state-of-the-art classification accuracies, actively resolving
ambiguity when required and (iii) identifies novel object categories.
Furthermore, we validate our system in an end-to-end fashion where the agent is
able to search for an object at a given pose from a pixel-based rendering. We
believe that this is a first step towards building modular, intelligent systems
that can be used for a wide range of tasks involving three dimensional objects.
Related papers
- Unobserved Object Detection using Generative Models [17.883297093049787]
This study introduces the novel task of 2D and 3D unobserved object detection for predicting the location of objects that are occluded or lie outside the image frame.
We adapt several state-of-the-art pre-trained generative models to solve this task, including 2D and 3D diffusion models and vision-language models.
arXiv Detail & Related papers (2024-10-08T09:57:14Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - You Only Look at One: Category-Level Object Representations for Pose
Estimation From a Single Example [26.866356430469757]
We present a method for achieving category-level pose estimation by inspection of just a single object from a desired category.
We demonstrate that our method runs in real-time, enabling a robot manipulator equipped with an RGBD sensor to perform online 6D pose estimation for novel objects.
arXiv Detail & Related papers (2023-05-22T01:32:24Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Detecting and Accommodating Novel Types and Concepts in an Embodied
Simulation Environment [4.507860128918788]
We present methods for two types of metacognitive tasks in an AI system.
We expand a neural classification model to accommodate a new category of object, and recognize when a novel object type is observed instead of misclassifying the observation as a known class.
We present a suite of experiments in rapidly accommodating the introduction of new categories and concepts and in novel type detection, and an architecture to integrate the two in an interactive system.
arXiv Detail & Related papers (2022-11-08T20:55:28Z) - Is an Object-Centric Video Representation Beneficial for Transfer? [86.40870804449737]
We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
arXiv Detail & Related papers (2022-07-20T17:59:44Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Disassembling Object Representations without Labels [75.2215716328001]
We study a new representation-learning task, which we termed as disassembling object representations.
Disassembling enables category-specific modularity in the learned representations.
We propose an unsupervised approach to achieving disassembling, named Unsupervised Disassembling Object Representation (UDOR)
arXiv Detail & Related papers (2020-04-03T08:23:09Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.