Related papers: Vision Transformers with Natural Language Semantics

Vision Transformers with Natural Language Semantics

URL: http://arxiv.org/abs/2402.17863v1
Date: Tue, 27 Feb 2024 19:54:42 GMT
Title: Vision Transformers with Natural Language Semantics
Authors: Young Kyung Kim, J. Mat\'ias Di Martino, Guillermo Sapiro
Abstract summary: Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP) We introduce a novel transformer model, Semantic Vision Transformers (sViT), which harnesses semantic information. SViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks.
Score: 13.535916922328287
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data while maintaining similar or superior performance. Furthermore, sViT demonstrates significant superiority in out-of-distribution generalization and robustness to natural distribution shifts, attributed to its scale invariance semantic characteristic. Notably, the use of semantic tokens significantly enhances the model's interpretability. Lastly, the proposed paradigm facilitates the introduction of new and powerful augmentation techniques at the token (or segment) level, increasing training data diversity and generalization capabilities. Just as sentences are made of words, images are formed by semantic objects; our proposed methodology leverages recent progress in object segmentation and takes an important and natural step toward interpretable and robust vision transformers.

Related papers

Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding [10.484788943232674]
This paper explores the advancements in transformer models, such as BERT and GPT, focusing on their superior performance in text understanding tasks. The results demonstrate state-of-the-art performance on benchmarks like GLUE and SQuAD, with F1 scores exceeding 90%, though challenges such as high computational costs persist.
arXiv Detail & Related papers (2025-03-26T04:45:33Z)
Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features. We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z)
ULTra: Unveiling Latent Token Interpretability in Transformer Based Understanding [14.84547724351634]
We introduce a novel framework that interprets Transformer embeddings, uncovering meaningful semantic patterns within them. We demonstrate that zero-shot unsupervised semantic segmentation can be performed effectively without any fine-tuning. Our approach achieves an accuracy of 67.2 % and an mIoU of 32.9 % on the COCO-Stuff dataset, as well as an mIoU of 51.9 % on the PASCAL VOC dataset.
arXiv Detail & Related papers (2024-11-15T19:36:50Z)
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary. We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z)
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain. We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features. DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z)
Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing. Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks. This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z)
Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z)
Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.