Related papers: Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames

Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames

URL: http://arxiv.org/abs/2302.04973v2
Date: Fri, 21 Jul 2023 01:40:31 GMT
Title: Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames
Authors: Ondrej Biza, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin F. Elsayed, Aravindh Mahendran and Thomas Kipf
Abstract summary: Slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress. We present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We evaluate our method on a range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVR, Objects Room and MultiShapeNet.
Score: 18.84636947819183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.

Related papers

Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance [0.9786690381850356]
This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness.<n>We advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects.<n>Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects.
arXiv Detail & Related papers (2025-10-18T10:42:30Z)
RDD: Robust Feature Detector and Descriptor using Deformable Transformer [8.01082121187363]
We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor.<n>We observed that deformable attention focuses on key locations, effectively reducing the search space complexity.<n>Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks.
arXiv Detail & Related papers (2025-05-12T19:24:45Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck. Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning [51.170479006249195]
We introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study. Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets. We present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches.
arXiv Detail & Related papers (2024-12-16T09:14:32Z)
Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Mesh Denoising Transformer [104.5404564075393]
Mesh denoising is aimed at removing noise from input meshes while preserving their feature structures. SurfaceFormer is a pioneering Transformer-based mesh denoising framework. New representation known as Local Surface Descriptor captures local geometric intricacies. Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation.
arXiv Detail & Related papers (2024-05-10T15:27:43Z)
SOAR: Advancements in Small Body Object Detection for Aerial Imagery Using State Space Models and Programmable Gradients [0.8873228457453465]
Small object detection in aerial imagery presents significant challenges in computer vision. Traditional methods using transformer-based models often face limitations stemming from the lack of specialized databases. This paper introduces two innovative approaches that significantly enhance detection and segmentation capabilities for small aerial objects.
arXiv Detail & Related papers (2024-05-02T19:47:08Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Rotating Features for Object Discovery [74.1465486264609]
We present Rotating Features, a generalization of complex-valued features to higher dimensions, and a new evaluation procedure for extracting objects from distributed representations. Together, these advancements enable us to scale distributed object-centric representations from simple toy to real-world data.
arXiv Detail & Related papers (2023-06-01T12:16:26Z)
Persistent Homology Meets Object Unity: Object Recognition in Clutter [2.356908851188234]
Recognition of occluded objects in unseen and unstructured indoor environments is a challenging problem for mobile robots. We propose a new descriptor, TOPS, for point clouds generated from depth images and an accompanying recognition framework, THOR, inspired by human reasoning. THOR outperforms state-of-the-art methods on both the datasets and achieves substantially higher recognition accuracy for all the scenarios of the UW-IS Occluded dataset.
arXiv Detail & Related papers (2023-05-05T19:42:39Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.