Related papers: Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

URL: http://arxiv.org/abs/2311.18296v2
Date: Thu, 25 Jan 2024 01:18:28 GMT
Title: Perceptual Group Tokenizer: Building Perception with Iterative Grouping
Authors: Zhiwei Deng, Ting Chen, Yang Li
Abstract summary: We propose the Perceptual Group Tokenizer, a model that relies on grouping operations to extract visual features and perform self-supervised representation learning. We show that the proposed model can achieve competitive computation performance compared to state-of-the-art vision architectures.
Score: 14.760204235027627
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

Related papers

Control-oriented Clustering of Visual Latent Representation [3.9838014203847862]
We study the geometry of the visual representation space in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse, we show a similar law of clustering in the visual representation space. We show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance.
arXiv Detail & Related papers (2024-10-07T14:21:51Z)
Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z)
Homomorphism Autoencoder -- Learning Group Structured Representations from Observed Transitions [51.71245032890532]
We propose methods enabling an agent acting upon the world to learn internal representations of sensory information consistent with actions that modify it. In contrast to existing work, our approach does not require prior knowledge of the group and does not restrict the set of actions the agent can perform.
arXiv Detail & Related papers (2022-07-25T11:22:48Z)
Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data. We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z)
An Empirical Investigation of Representation Learning for Imitation [76.48784376425911]
Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data. We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation.
arXiv Detail & Related papers (2022-05-16T11:23:42Z)
Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR) Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z)
Information Maximization Clustering via Multi-View Self-Labelling [9.947717243638289]
We propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a net. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively.
arXiv Detail & Related papers (2021-03-12T16:04:41Z)
Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning [91.58529629419135]
We consider how to characterise visual groupings discovered automatically by deep neural networks. We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings.
arXiv Detail & Related papers (2020-10-27T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.