Learning What and Where -- Unsupervised Disentangling Location and
Identity Tracking
- URL: http://arxiv.org/abs/2205.13349v1
- Date: Thu, 26 May 2022 13:30:14 GMT
- Title: Learning What and Where -- Unsupervised Disentangling Location and
Identity Tracking
- Authors: Manuel Traub, Sebastian Otte, Tobias Menge, Matthias Karlbauer, Jannik
Th\"ummel, Martin V. Butz
- Abstract summary: We introduce an unsupervisedd LOCation and Identity tracking system (Loci)
Inspired by the dorsal-ventral pathways in the brain, Loci tackles the what-and-where binding problem by means of a self-supervised segregation mechanism.
Loci may set the stage for deeper, explanation-oriented video processing.
- Score: 0.44040106718326594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our brain can almost effortlessly decompose visual data streams into
background and salient objects. Moreover, it can track the objects and
anticipate their motion and interactions. In contrast, recent object reasoning
datasets, such as CATER, have revealed fundamental shortcomings of current
vision-based AI systems, particularly when targeting explicit object encodings,
object permanence, and object reasoning. We introduce an unsupervised
disentangled LOCation and Identity tracking system (Loci), which excels on the
CATER tracking challenge. Inspired by the dorsal-ventral pathways in the brain,
Loci tackles the what-and-where binding problem by means of a self-supervised
segregation mechanism. Our autoregressive neural network partitions and
distributes the visual input stream across separate, identically-parameterized
and autonomously recruited neural network modules. Each module binds what with
where, that is, compressed Gestalt encodings with locations. On the deep latent
encoding levels interaction dynamics are processed. Besides exhibiting superior
performance in current benchmarks, we propose that Loci may set the stage for
deeper, explanation-oriented video processing -- akin to some deeper networked
processes in the brain that appear to integrate individual entity and
spatiotemporal interaction dynamics into event structures.
Related papers
- Tracking objects that change in appearance with phase synchrony [14.784044408031098]
We show that a novel deep learning circuit can learn to control attention to features separately from their location in the world through neural synchrony.
We compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs) using FeatureTracker: a large-scale challenge.
Our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization.
arXiv Detail & Related papers (2024-10-02T23:30:05Z) - Connectivity-Inspired Network for Context-Aware Recognition [1.049712834719005]
We focus on the effect of incorporating circuit motifs found in biological brains to address visual recognition.
Our convolutional architecture is inspired by the connectivity of human cortical and subcortical streams.
We present a new plug-and-play module to model context awareness.
arXiv Detail & Related papers (2024-09-06T15:42:10Z) - The Dynamic Net Architecture: Learning Robust and Holistic Visual Representations Through Self-Organizing Networks [3.9848584845601014]
We present a novel intelligent-system architecture called "Dynamic Net Architecture" (DNA)
DNA relies on recurrence-stabilized networks and discuss it in application to vision.
arXiv Detail & Related papers (2024-07-08T06:22:10Z) - Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes.
RHGNet introduces a top-down pathway that works in different ways in the training and inference processes.
Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z) - BI AVAN: Brain inspired Adversarial Visual Attention Network [67.05560966998559]
We propose a brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity.
Our model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner.
arXiv Detail & Related papers (2022-10-27T22:20:36Z) - Bi-directional Object-context Prioritization Learning for Saliency
Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations.
We observe that spatial attention works concurrently with object-based attention in the human visual recognition system.
We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z) - The Challenge of Appearance-Free Object Tracking with Feedforward Neural
Networks [12.081808043723937]
$itPathTracker$ tests the ability of observers to learn to track objects solely by their motion.
We find that standard 3D-convolutional deep network models struggle to solve this task.
strategies for appearance-free object tracking from biological vision can inspire solutions.
arXiv Detail & Related papers (2021-09-30T17:58:53Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Attentional Separation-and-Aggregation Network for Self-supervised
Depth-Pose Learning in Dynamic Scenes [19.704284616226552]
Learning depth and ego-motion from unlabeled videos via self-supervision from epipolar projection can improve the robustness and accuracy of the 3D perception and localization of vision-based robots.
However, the rigid projection computed by ego-motion cannot represent all scene points, such as points on moving objects, leading to false guidance in these regions.
We propose an Attentional Separation-and-Aggregation Network (ASANet) which can learn to distinguish and extract the scene's static and dynamic characteristics via the attention mechanism.
arXiv Detail & Related papers (2020-11-18T16:07:30Z) - Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks.
First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts.
Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z) - Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [51.79577908317031]
We propose a new framework called Ventral-Dorsal Networks (VDNets)
Inspired by the structure of the human visual system, we propose the integration of a "Ventral Network" and a "Dorsal Network"
Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches.
arXiv Detail & Related papers (2020-05-15T23:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.