Control-oriented Clustering of Visual Latent Representation
- URL: http://arxiv.org/abs/2410.05063v4
- Date: Wed, 05 Feb 2025 22:12:00 GMT
- Title: Control-oriented Clustering of Visual Latent Representation
- Authors: Han Qi, Haocheng Yin, Heng Yang,
- Abstract summary: We study the geometry of the visual representation space in an image-based control pipeline learned from behavior cloning.
Inspired by the phenomenon of neural collapse, we show a similar law of clustering in the visual representation space.
We show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance.
- Score: 3.9838014203847862
- License:
- Abstract: We initiate a study of the geometry of the visual representation space -- the information channel from the vision encoder to the action decoder -- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (arXiv:2008.08186), we empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Specifically, in discrete image-based control (e.g., Lunar Lander), the visual representations cluster according to the natural discrete action labels; in continuous image-based control (e.g., Planar Pushing and Block Stacking), the clustering emerges according to "control-oriented" classes that are based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output. Each of the classes corresponds to one relative pose orthant (REPO). Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.
Related papers
- Self-Supervised Representation Learning for Adversarial Attack Detection [6.528181610035978]
Supervised learning-based adversarial attack detection methods rely on a large number of labeled data.
We propose a self-supervised representation learning framework for the adversarial attack detection task to address this drawback.
arXiv Detail & Related papers (2024-07-05T09:37:16Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Perceptual Group Tokenizer: Building Perception with Iterative Grouping [14.760204235027627]
We propose the Perceptual Group Tokenizer, a model that relies on grouping operations to extract visual features and perform self-supervised representation learning.
We show that the proposed model can achieve competitive computation performance compared to state-of-the-art vision architectures.
arXiv Detail & Related papers (2023-11-30T07:00:14Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - Self-supervised Contrastive Learning for Cross-domain Hyperspectral
Image Representation [26.610588734000316]
This paper introduces a self-supervised learning framework suitable for hyperspectral images that are inherently challenging to annotate.
The proposed framework architecture leverages cross-domain CNN, allowing for learning representations from different hyperspectral images.
The experimental results demonstrate the advantage of the proposed self-supervised representation over models trained from scratch or other transfer learning methods.
arXiv Detail & Related papers (2022-02-08T16:16:45Z) - Looking Beyond Corners: Contrastive Learning of Visual Representations
for Keypoint Detection and Description Extraction [1.5749416770494706]
Learnable keypoint detectors and descriptors are beginning to outperform classical hand-crafted feature extraction methods.
Recent studies on self-supervised learning of visual representations have driven the increasing performance of learnable models based on deep networks.
We propose the Correspondence Network (CorrNet) that learns to detect repeatable keypoints and to extract discriminative descriptions.
arXiv Detail & Related papers (2021-12-22T16:27:11Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.