Recurrent Attention Models with Object-centric Capsule Representation
for Multi-object Recognition
- URL: http://arxiv.org/abs/2110.04954v1
- Date: Mon, 11 Oct 2021 01:41:21 GMT
- Title: Recurrent Attention Models with Object-centric Capsule Representation
for Multi-object Recognition
- Authors: Hossein Adeli, Seoyoung Ahn, Gregory Zelinsky
- Abstract summary: We show that an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition.
Our work takes a step toward a general architecture for how to integrate recurrent object-centric representation into the planning of attentional glimpses.
- Score: 4.143091738981101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The visual system processes a scene using a sequence of selective glimpses,
each driven by spatial and object-based attention. These glimpses reflect what
is relevant to the ongoing task and are selected through recurrent processing
and recognition of the objects in the scene. In contrast, most models treat
attention selection and recognition as separate stages in a feedforward
process. Here we show that using capsule networks to create an object-centric
hidden representation in an encoder-decoder model with iterative glimpse
attention yields effective integration of attention and recognition. We
evaluate our model on three multi-object recognition tasks; highly overlapping
digits, digits among distracting clutter and house numbers, and show that it
learns to effectively move its glimpse window, recognize and reconstruct the
objects, all with only the classification as supervision. Our work takes a step
toward a general architecture for how to integrate recurrent object-centric
representation into the planning of attentional glimpses.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Understanding Self-Supervised Pretraining with Part-Aware Representation
Learning [88.45460880824376]
We study the capability that self-supervised representation pretraining methods learn part-aware representations.
Results show that the fully-supervised model outperforms self-supervised models for object-level recognition.
arXiv Detail & Related papers (2023-01-27T18:58:42Z) - Reconstruction-guided attention improves the robustness and shape
processing of neural networks [5.156484100374057]
We build an iterative encoder-decoder network that generates an object reconstruction and uses it as top-down attentional feedback.
Our model shows strong generalization performance against various image perturbations.
Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism.
arXiv Detail & Related papers (2022-09-27T18:32:22Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Revisiting spatio-temporal layouts for compositional action recognition [63.04778884595353]
We take an object-centric approach to action recognition.
The main focus of this paper is compositional/few-shot action recognition.
We demonstrate how to improve the performance of appearance-based models by fusion with layout-based models.
arXiv Detail & Related papers (2021-11-02T23:04:39Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Object-Centric Learning with Slot Attention [43.684193749891506]
We present the Slot Attention module, an architectural component that interfaces with perceptual representations.
Slot Attention produces task-dependent abstract representations which we call slots.
We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions.
arXiv Detail & Related papers (2020-06-26T15:31:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.