Related papers: Neural Clustering based Visual Representation Learning

Neural Clustering based Visual Representation Learning

URL: http://arxiv.org/abs/2403.17409v1
Date: Tue, 26 Mar 2024 06:04:50 GMT
Title: Neural Clustering based Visual Representation Learning
Authors: Guikun Chen, Xia Li, Yi Yang, Wenguan Wang,
Abstract summary: Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
Score: 61.72646814537163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm.

Related papers

Perceptual Group Tokenizer: Building Perception with Iterative Grouping [14.760204235027627]
We propose the Perceptual Group Tokenizer, a model that relies on grouping operations to extract visual features and perform self-supervised representation learning. We show that the proposed model can achieve competitive computation performance compared to state-of-the-art vision architectures.
arXiv Detail & Related papers (2023-11-30T07:00:14Z)
On the Transition from Neural Representation to Symbolic Knowledge [2.2528422603742304]
We propose a Neural-Symbolic Transitional Dictionary Learning (TDL) framework that employs an EM algorithm to learn a transitional representation of data. We implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game. We additionally use RL enabled by the Markovian of diffusion models to further tune the learned prototypes.
arXiv Detail & Related papers (2023-08-03T19:29:35Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Robust Representation Learning by Clustering with Bisimulation Metrics for Visual Reinforcement Learning with Distractions [9.088460902782547]
Clustering with Bisimulation Metrics (CBM) learns robust representations by grouping visual observations in the latent space. CBM alternates between two steps: (1) grouping observations by measuring their bisimulation distances to the learned prototypes; (2) learning a set of prototypes according to the current cluster assignments. Experiments demonstrate that CBM significantly improves the sample efficiency of popular visual RL algorithms.
arXiv Detail & Related papers (2023-02-12T13:27:34Z)
Multi-Branch Deep Radial Basis Function Networks for Facial Emotion Recognition [80.35852245488043]
We propose a CNN based architecture enhanced with multiple branches formed by radial basis function (RBF) units. RBF units capture local patterns shared by similar instances using an intermediate representation. We show it is the incorporation of local information what makes the proposed model competitive.
arXiv Detail & Related papers (2021-09-07T21:05:56Z)
Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Information Maximization Clustering via Multi-View Self-Labelling [9.947717243638289]
We propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a net. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively.
arXiv Detail & Related papers (2021-03-12T16:04:41Z)
Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction [158.88345945211185]
We present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. We propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner.
arXiv Detail & Related papers (2021-01-08T04:14:29Z)
Two-Level Adversarial Visual-Semantic Coupling for Generalized Zero-shot Learning [21.89909688056478]
We propose a new two-level joint idea to augment the generative network with an inference network during training. This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains. We evaluate our approach on four benchmark datasets against several state-of-the-art methods, and show its performance.
arXiv Detail & Related papers (2020-07-15T15:34:09Z)
Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. We combine the resulting representations with global scene information for accurately predicting visual saliency. Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.