Simple Self Organizing Map with Visual Transformer
- URL: http://arxiv.org/abs/2503.04121v1
- Date: Thu, 06 Mar 2025 05:58:41 GMT
- Title: Simple Self Organizing Map with Visual Transformer
- Authors: Alan Luo, Kaiwen Yuan,
- Abstract summary: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks.<n>ViTs tend to underperform on smaller datasets due to their inherent lack of inductive biases.<n>Self-Organizing Maps (SOMs) are inherently structured to preserve topology and spatial organization.
- Score: 1.3121410433987561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code will be publicly available.
Related papers
- Structured Initialization for Attention in Vision Transformers [34.374054040300805]
convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on small-scale problems.
We argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT.
This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications.
arXiv Detail & Related papers (2024-04-01T14:34:47Z) - Enhancing Performance of Vision Transformers on Small Datasets through
Local Inductive Bias Incorporation [13.056764072568749]
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) on smaller datasets.
We propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs.
Our proposed module is memory and efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens.
arXiv Detail & Related papers (2023-05-15T11:23:18Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer [52.0699787446221]
We propose MonoViT, a framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy.
arXiv Detail & Related papers (2022-08-06T16:54:45Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - ConViT: Improving Vision Transformers with Soft Convolutional Inductive
Biases [16.308432111311195]
Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification.
We introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias.
The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
arXiv Detail & Related papers (2021-03-19T09:11:20Z) - LSM: Learning Subspace Minimization for Low-level Vision [78.27774638569218]
We replace the regularization term with a learnable subspace constraint, and preserve the data term to exploit domain knowledge.
This learning subspace minimization (LSM) framework unifies the network structures and the parameters for many low-level vision tasks.
We demonstrate our LSM framework on four low-level tasks including interactive image segmentation, video segmentation, stereo matching, and optical flow, and validate the network on various datasets.
arXiv Detail & Related papers (2020-04-20T10:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.