Related papers: Explicitly Disentangled Representations in Object-Centric Learning

Explicitly Disentangled Representations in Object-Centric Learning

URL: http://arxiv.org/abs/2401.10148v1
Date: Thu, 18 Jan 2024 17:22:11 GMT
Title: Explicitly Disentangled Representations in Object-Centric Learning
Authors: Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland
Abstract summary: We propose a novel architecture that biases object-centric models toward disentangling shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.

Related papers

Learning Object Focused Attention [5.340670496809963]
We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability, and learn representations based on object shapes rather than spurious correlations via general textures.
arXiv Detail & Related papers (2025-04-10T23:23:26Z)
Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image [52.11275397911693]
We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image. We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts. Our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.
arXiv Detail & Related papers (2025-04-04T05:08:04Z)
Leveraging Foundation Models To learn the shape of semi-fluid deformable objects [0.7895162173260983]
A keen interest was manifested by researchers in the last decade to characterize and manipulate deformable objects of non-fluid nature. In this paper, we address the subject of characterizing weld pool to define stable features that serve as information for motion control objectives. The performance of knowledge distillation from foundation models into a smaller generative model shows prominent results in the characterization of deformable objects.
arXiv Detail & Related papers (2024-11-25T13:41:35Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers [11.155818952879146]
Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics. Can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models? We try to learn such disentangled representations for the case of static images citepnsb, without making any specific assumptions about the kind of attributes that an object might have.
arXiv Detail & Related papers (2024-07-03T15:43:54Z)
Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data. Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)
Generalization and Robustness Implications in Object-Centric Learning [23.021791024676986]
In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets. From our experimental study, we find object-centric representations to be generally useful for downstream tasks.
arXiv Detail & Related papers (2021-07-01T17:51:11Z)
Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances. We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z)
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation [53.49821324597837]
Weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. We present a Context Decoupling Augmentation ( CDA) method to change the inherent context in which the objects appear. To validate the effectiveness of the proposed method, extensive experiments on PASCAL VOC 2012 dataset with several alternative network architectures demonstrate that CDA can boost various popular WSSS methods to the new state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-03-02T15:05:09Z)
Combining Semantic Guidance and Deep Reinforcement Learning For Generating Human Level Paintings [22.889059874754242]
Generation of stroke-based non-photorealistic imagery is an important problem in the computer vision community. Previous methods have been limited to datasets with little variation in position, scale and saliency of the foreground object. We propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time.
arXiv Detail & Related papers (2020-11-25T09:00:04Z)
Saliency-driven Class Impressions for Feature Visualization of Deep Neural Networks [55.11806035788036]
It is advantageous to visualize the features considered to be essential for classification. Existing visualization methods develop high confidence images consisting of both background and foreground features. In this work, we propose a saliency-driven approach to visualize discriminative features that are considered most important for a given task.
arXiv Detail & Related papers (2020-07-31T06:11:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.