Divided Attention: Unsupervised Multi-Object Discovery with Contextually
Separated Slots
- URL: http://arxiv.org/abs/2304.01430v2
- Date: Thu, 22 Jun 2023 23:30:10 GMT
- Title: Divided Attention: Unsupervised Multi-Object Discovery with Contextually
Separated Slots
- Authors: Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, Stefano
Soatto
- Abstract summary: We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision.
It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention.
- Score: 78.23772771485635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a method to segment the visual field into independently moving
regions, trained with no ground truth or supervision. It consists of an
adversarial conditional encoder-decoder architecture based on Slot Attention,
modified to use the image as context to decode optical flow without attempting
to reconstruct the image itself. In the resulting multi-modal representation,
one modality (flow) feeds the encoder to produce separate latent codes (slots),
whereas the other modality (image) conditions the decoder to generate the first
(flow) from the slots. This design frees the representation from having to
encode complex nuisance variability in the image due to, for instance,
illumination and reflectance properties of the scene. Since customary
autoencoding based on minimizing the reconstruction error does not preclude the
entire flow from being encoded into a single slot, we modify the loss to an
adversarial criterion based on Contextual Information Separation. The resulting
min-max optimization fosters the separation of objects and their assignment to
different attention slots, leading to Divided Attention, or DivA. DivA
outperforms recent unsupervised multi-object motion segmentation methods while
tripling run-time speed up to 104FPS and reducing the performance gap from
supervised methods to 12% or less. DivA can handle different numbers of objects
and different image sizes at training and test time, is invariant to
permutation of object labels, and does not require explicit regularization.
Related papers
- SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z) - DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut [62.63481844384229]
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks.
In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method.
Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.
arXiv Detail & Related papers (2024-06-05T01:32:31Z) - Motion-inductive Self-supervised Object Discovery in Videos [99.35664705038728]
We propose a model for processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation.
We demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets.
arXiv Detail & Related papers (2022-10-01T08:38:28Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - Learning Disentangled Representation Implicitly via Transformer for
Occluded Person Re-Identification [35.40162083252931]
DRL-Net is a representation learning network that handles occluded re-ID without requiring strict person image alignment or any additional supervision.
It measures image similarity by automatically disentangling the representation of undefined semantic components.
The DRL-Net achieves superior re-ID performance consistently and outperforms the state-of-the-art by large margins for Occluded-DukeMTMC.
arXiv Detail & Related papers (2021-07-06T04:24:10Z) - FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point
Cloud Segmentation [30.736361776703568]
Scene understanding based on LiDAR point cloud is an essential task for autonomous cars to drive safely.
Most existing methods simply stack different point attributes/modalities as image channels to increase information capacity.
We design FPS-Net, a convolutional fusion network that exploits the uniqueness and discrepancy among the projected image channels for optimal point cloud segmentation.
arXiv Detail & Related papers (2021-03-01T04:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.