Related papers: Learning What and Where -- Unsupervised Disentangling Location and Identity Tracking

Learning What and Where -- Unsupervised Disentangling Location and Identity Tracking

URL: http://arxiv.org/abs/2205.13349v1
Date: Thu, 26 May 2022 13:30:14 GMT
Title: Learning What and Where -- Unsupervised Disentangling Location and Identity Tracking
Authors: Manuel Traub, Sebastian Otte, Tobias Menge, Matthias Karlbauer, Jannik Th\"ummel, Martin V. Butz
Abstract summary: We introduce an unsupervisedd LOCation and Identity tracking system (Loci) Inspired by the dorsal-ventral pathways in the brain, Loci tackles the what-and-where binding problem by means of a self-supervised segregation mechanism. Loci may set the stage for deeper, explanation-oriented video processing.
Score: 0.44040106718326594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can track the objects and anticipate their motion and interactions. In contrast, recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object encodings, object permanence, and object reasoning. We introduce an unsupervised disentangled LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal-ventral pathways in the brain, Loci tackles the what-and-where binding problem by means of a self-supervised segregation mechanism. Our autoregressive neural network partitions and distributes the visual input stream across separate, identically-parameterized and autonomously recruited neural network modules. Each module binds what with where, that is, compressed Gestalt encodings with locations. On the deep latent encoding levels interaction dynamics are processed. Besides exhibiting superior performance in current benchmarks, we propose that Loci may set the stage for deeper, explanation-oriented video processing -- akin to some deeper networked processes in the brain that appear to integrate individual entity and spatiotemporal interaction dynamics into event structures.

Related papers

Discovering Chunks in Neural Embeddings for Interpretability [53.80157905839065]
We propose leveraging the principle of chunking to interpret artificial neural population activities. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities. We identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts.
arXiv Detail & Related papers (2025-02-03T20:30:46Z)
Tracking objects that change in appearance with phase synchrony [14.784044408031098]
We show that a novel deep learning circuit can learn to control attention to features separately from their location in the world through neural synchrony. We compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs) using FeatureTracker: a large-scale challenge. Our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization.
arXiv Detail & Related papers (2024-10-02T23:30:05Z)
Connectivity-Inspired Network for Context-Aware Recognition [1.049712834719005]
We focus on the effect of incorporating circuit motifs found in biological brains to address visual recognition. Our convolutional architecture is inspired by the connectivity of human cortical and subcortical streams. We present a new plug-and-play module to model context awareness.
arXiv Detail & Related papers (2024-09-06T15:42:10Z)
The Dynamic Net Architecture: Learning Robust and Holistic Visual Representations Through Self-Organizing Networks [3.9848584845601014]
We present a novel intelligent-system architecture called "Dynamic Net Architecture" (DNA) DNA relies on recurrence-stabilized networks and discuss it in application to vision.
arXiv Detail & Related papers (2024-07-08T06:22:10Z)
Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes. RHGNet introduces a top-down pathway that works in different ways in the training and inference processes. Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z)
Probing neural representations of scene perception in a hippocampally dependent task using artificial neural networks [1.0312968200748116]
Deep artificial neural networks (DNNs) trained through backpropagation provide effective models of the mammalian visual system. We describe a novel scene perception benchmark inspired by a hippocampal dependent task. Using a network architecture inspired by the connectivity between temporal lobe structures and the hippocampus, we demonstrate that DNNs trained using a triplet loss can learn this task.
arXiv Detail & Related papers (2023-03-11T10:26:25Z)
BI AVAN: Brain inspired Adversarial Visual Attention Network [67.05560966998559]
We propose a brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner.
arXiv Detail & Related papers (2022-10-27T22:20:36Z)
Bi-directional Object-context Prioritization Learning for Saliency Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations. We observe that spatial attention works concurrently with object-based attention in the human visual recognition system. We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z)
The Challenge of Appearance-Free Object Tracking with Feedforward Neural Networks [12.081808043723937]
$itPathTracker$ tests the ability of observers to learn to track objects solely by their motion. We find that standard 3D-convolutional deep network models struggle to solve this task. strategies for appearance-free object tracking from biological vision can inspire solutions.
arXiv Detail & Related papers (2021-09-30T17:58:53Z)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
Attentional Separation-and-Aggregation Network for Self-supervised Depth-Pose Learning in Dynamic Scenes [19.704284616226552]
Learning depth and ego-motion from unlabeled videos via self-supervision from epipolar projection can improve the robustness and accuracy of the 3D perception and localization of vision-based robots. However, the rigid projection computed by ego-motion cannot represent all scene points, such as points on moving objects, leading to false guidance in these regions. We propose an Attentional Separation-and-Aggregation Network (ASANet) which can learn to distinguish and extract the scene's static and dynamic characteristics via the attention mechanism.
arXiv Detail & Related papers (2020-11-18T16:07:30Z)
Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z)
Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [51.79577908317031]
We propose a new framework called Ventral-Dorsal Networks (VDNets) Inspired by the structure of the human visual system, we propose the integration of a "Ventral Network" and a "Dorsal Network" Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches.
arXiv Detail & Related papers (2020-05-15T23:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.