Perceptual Group Tokenizer: Building Perception with Iterative Grouping
- URL: http://arxiv.org/abs/2311.18296v2
- Date: Thu, 25 Jan 2024 01:18:28 GMT
- Title: Perceptual Group Tokenizer: Building Perception with Iterative Grouping
- Authors: Zhiwei Deng, Ting Chen, Yang Li
- Abstract summary: We propose the Perceptual Group Tokenizer, a model that relies on grouping operations to extract visual features and perform self-supervised representation learning.
We show that the proposed model can achieve competitive computation performance compared to state-of-the-art vision architectures.
- Score: 14.760204235027627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human visual recognition system shows astonishing capability of compressing
visual information into a set of tokens containing rich representations without
label supervision. One critical driving principle behind it is perceptual
grouping. Despite being widely used in computer vision in the early 2010s, it
remains a mystery whether perceptual grouping can be leveraged to derive a
neural visual recognition backbone that generates as powerful representations.
In this paper, we propose the Perceptual Group Tokenizer, a model that entirely
relies on grouping operations to extract visual features and perform
self-supervised representation learning, where a series of grouping operations
are used to iteratively hypothesize the context for pixels or superpixels to
refine feature representations. We show that the proposed model can achieve
competitive performance compared to state-of-the-art vision architectures, and
inherits desirable properties including adaptive computation without
re-training, and interpretability. Specifically, Perceptual Group Tokenizer
achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear
probe evaluation, marking a new progress under this paradigm.
Related papers
- Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Homomorphism Autoencoder -- Learning Group Structured Representations from Observed Transitions [51.71245032890532]
We propose methods enabling an agent acting upon the world to learn internal representations of sensory information consistent with actions that modify it.
In contrast to existing work, our approach does not require prior knowledge of the group and does not restrict the set of actions the agent can perform.
arXiv Detail & Related papers (2022-07-25T11:22:48Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - An Empirical Investigation of Representation Learning for Imitation [76.48784376425911]
Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data.
We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation.
arXiv Detail & Related papers (2022-05-16T11:23:42Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Information Maximization Clustering via Multi-View Self-Labelling [9.947717243638289]
We propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations.
This is achieved by integrating a discrete representation into the self-supervised paradigm through a net.
Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively.
arXiv Detail & Related papers (2021-03-12T16:04:41Z) - Quantifying Learnability and Describability of Visual Concepts Emerging
in Representation Learning [91.58529629419135]
We consider how to characterise visual groupings discovered automatically by deep neural networks.
We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings.
arXiv Detail & Related papers (2020-10-27T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.