Related papers: Top-Down Guidance for Learning Object-Centric Representations

Top-Down Guidance for Learning Object-Centric Representations

URL: http://arxiv.org/abs/2405.10598v3
Date: Sun, 24 Aug 2025 12:59:48 GMT
Title: Top-Down Guidance for Learning Object-Centric Representations
Authors: Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei,
Abstract summary: Top-Down Guided Network (TDGNet) is a top-down pathway to improve object-centric representations.<n>We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity.
Score: 30.06924788022504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans' innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans' top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning.

Related papers

Shortcut Learning Susceptibility in Vision Classifiers [3.004632712148892]
Shortcut learning is where machine learning models exploit spurious correlations in data instead of capturing meaningful features. This phenomenon is prevalent across various machine learning applications, including vision, natural language processing, and speech recognition. We systematically evaluate these architectures by introducing deliberate shortcuts into the dataset that are positionally correlated with class labels.
arXiv Detail & Related papers (2025-02-13T10:25:52Z)
EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation [30.93060152004132]
Learning to manipulate objects from high-dimensional observations presents significant challenges.<n>Recent approaches have utilized large-scale offline data to train models from pixel observations.<n>We propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer.
arXiv Detail & Related papers (2024-12-25T13:50:15Z)
Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning [51.170479006249195]
We introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study.<n>Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets.<n>We present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches.
arXiv Detail & Related papers (2024-12-16T09:14:32Z)
Bootstrapping Top-down Information for Self-modulating Slot Attention [29.82550058869251]
We propose a novel OCL framework incorporating a top-down pathway. This pathway bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
arXiv Detail & Related papers (2024-11-04T05:00:49Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI [32.40827290083577]
Deciphering visual content from functional Magnetic Resonance Imaging (fMRI) helps illuminate the human vision system. Previous approaches primarily employ subject-specific models, sensitive to training sample size. We propose shallow subject-specific adapters to map cross-subject fMRI data into unified representations. During training, we leverage both visual and textual supervision for multi-modal brain decoding.
arXiv Detail & Related papers (2024-03-11T01:18:49Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Multi-Object Navigation with dynamically learned neural implicit representations [10.182418917501064]
We propose to structure neural networks with two neural implicit representations, which are learned dynamically during each episode. We evaluate the agent on Multi-Object Navigation and show the high impact of using neural implicit representations as a memory source.
arXiv Detail & Related papers (2022-10-11T04:06:34Z)
Learning Deep Representations via Contrastive Learning for Instance Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL) In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Compositional Scene Representation Learning via Reconstruction: A Survey [48.33349317481124]
Compositional scene representation learning is a task that enables such abilities. Deep neural networks have been proven to be advantageous in representation learning. Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation.
arXiv Detail & Related papers (2022-02-15T02:14:05Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances. We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z)
PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context. We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z)
Relevance-Guided Modeling of Object Dynamics for Reinforcement Learning [0.0951828574518325]
Current deep reinforcement learning (RL) approaches incorporate minimal prior knowledge about the environment. We propose a framework for reasoning about object dynamics and behavior to rapidly determine minimal and task-specific object representations. We also highlight the potential of this framework on several Atari games, using our object representation and standard RL and planning algorithms to learn dramatically faster than existing deep RL algorithms.
arXiv Detail & Related papers (2020-03-03T08:18:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.