Vision Transformers with Natural Language Semantics
- URL: http://arxiv.org/abs/2402.17863v1
- Date: Tue, 27 Feb 2024 19:54:42 GMT
- Title: Vision Transformers with Natural Language Semantics
- Authors: Young Kyung Kim, J. Mat\'ias Di Martino, Guillermo Sapiro
- Abstract summary: Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP)
We introduce a novel transformer model, Semantic Vision Transformers (sViT), which harnesses semantic information.
SViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks.
- Score: 13.535916922328287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokens or patches within Vision Transformers (ViT) lack essential semantic
information, unlike their counterparts in natural language processing (NLP).
Typically, ViT tokens are associated with rectangular image patches that lack
specific semantic context, making interpretation difficult and failing to
effectively encapsulate information. We introduce a novel transformer model,
Semantic Vision Transformers (sViT), which leverages recent progress on
segmentation models to design novel tokenizer strategies. sViT effectively
harnesses semantic information, creating an inductive bias reminiscent of
convolutional neural networks while capturing global dependencies and
contextual information within images that are characteristic of transformers.
Through validation using real datasets, sViT demonstrates superiority over ViT,
requiring less training data while maintaining similar or superior performance.
Furthermore, sViT demonstrates significant superiority in out-of-distribution
generalization and robustness to natural distribution shifts, attributed to its
scale invariance semantic characteristic. Notably, the use of semantic tokens
significantly enhances the model's interpretability. Lastly, the proposed
paradigm facilitates the introduction of new and powerful augmentation
techniques at the token (or segment) level, increasing training data diversity
and generalization capabilities. Just as sentences are made of words, images
are formed by semantic objects; our proposed methodology leverages recent
progress in object segmentation and takes an important and natural step toward
interpretable and robust vision transformers.
Related papers
- VISIT: Visualizing and Interpreting the Semantic Information Flow of
Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary.
We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input.
We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - Demystify Self-Attention in Vision Transformers from a Semantic
Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing.
Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks.
This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - Taming Transformers for High-Resolution Image Synthesis [16.86600007830682]
transformers are designed to learn long-range interactions on sequential data.
They contain no inductive bias that prioritizes local interactions.
This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images.
We show how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
arXiv Detail & Related papers (2020-12-17T18:57:28Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.