Transfer Learning for Fine-grained Classification Using Semi-supervised
Learning and Visual Transformers
- URL: http://arxiv.org/abs/2305.10018v1
- Date: Wed, 17 May 2023 07:51:35 GMT
- Title: Transfer Learning for Fine-grained Classification Using Semi-supervised
Learning and Visual Transformers
- Authors: Manuel Lagunas, Brayan Impata, Victor Martinez, Virginia Fernandez,
Christos Georgakis, Sofia Braun, Felipe Bertrand
- Abstract summary: Visual transformers (ViT) have emerged as a powerful tool for image classification.
In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques.
Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data.
- Score: 1.694405932826705
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-grained classification is a challenging task that involves identifying
subtle differences between objects within the same category. This task is
particularly challenging in scenarios where data is scarce. Visual transformers
(ViT) have recently emerged as a powerful tool for image classification, due to
their ability to learn highly expressive representations of visual data using
self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine
tuned using semi-supervised learning techniques, suitable for situations where
we have lack of annotated data. This is particularly common in e-commerce,
where images are readily available but labels are noisy, nonexistent, or
expensive to obtain. Our results demonstrate that Semi-ViT outperforms
traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned
with limited annotated data. These findings indicate that Semi-ViTs hold
significant promise for applications that require precise and fine-grained
classification of visual data.
Related papers
- Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks [43.473390101413166]
Self-Supervised Learning for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks.<n>This study aims to bridge the gap by systematically evaluating the use of unmodified features across image classification and segmentation tasks.
arXiv Detail & Related papers (2025-09-18T11:46:07Z) - LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning [8.104991333199264]
Vision Transformers (ViTs) have revolutionized computer vision tasks with their exceptional performance.<n>This work addresses the particularly challenging scenario of random data forgetting in ViTs.<n>We propose LetheViT, a contrastive unlearning method tailored for ViTs.
arXiv Detail & Related papers (2025-08-03T03:37:31Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos.
It examines their potential for improved generalization and explainability, especially with limited training data.
By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction [17.989559761931435]
We propose a novel "Fine-grained Visual-Semantic Interaction" framework for WSI classification.
It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics.
Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset.
arXiv Detail & Related papers (2024-02-29T16:29:53Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Classification of Alzheimers Disease with Deep Learning on Eye-tracking
Data [0.7366405857677227]
We investigate whether we can improve on existing results by using a Deep-Learning classifier trained end-to-end on raw ET data.
A main challenge in applying VTNet to our target AD classification task is that the available ET data sequences are much longer than those used in the previous confusion detection task.
We show that VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model.
arXiv Detail & Related papers (2023-09-22T02:02:59Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks.
Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI)
We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.