Hybrid BYOL-ViT: Efficient approach to deal with small Datasets
- URL: http://arxiv.org/abs/2111.04845v1
- Date: Mon, 8 Nov 2021 21:44:31 GMT
- Title: Hybrid BYOL-ViT: Efficient approach to deal with small Datasets
- Authors: Safwen Naimi, Rien van Leeuwen, Wided Souidene and Slim Ben Saoud
- Abstract summary: In this paper, we investigate how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network.
We show that the low-level features derived from a self-supervised architecture can improve the robustness and the overall performance of this emergent architecture.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised learning can learn large representational spaces, which are
crucial for handling difficult learning tasks. However, due to the design of
the model, classical image classification approaches struggle to generalize to
new problems and new situations when dealing with small datasets. In fact,
supervised learning can lose the location of image features which leads to
supervision collapse in very deep architectures. In this paper, we investigate
how self-supervision with strong and sufficient augmentation of unlabeled data
can train effectively the first layers of a neural network even better than
supervised learning, with no need for millions of labeled data. The main goal
is to disconnect pixel data from annotation by getting generic task-agnostic
low-level features. Furthermore, we look into Vision Transformers (ViT) and
show that the low-level features derived from a self-supervised architecture
can improve the robustness and the overall performance of this emergent
architecture. We evaluated our method on one of the smallest open-source
datasets STL-10 and we obtained a significant boost of performance from 41.66%
to 83.25% when inputting low-level features from a self-supervised learning
architecture to the ViT instead of the raw images.
Related papers
- Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation [53.95204595640208]
Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data.
Previous approaches have generated synthetic images at high resolutions without leveraging information from real images.
MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features.
arXiv Detail & Related papers (2024-11-26T02:23:31Z) - SeiT++: Masked Token Modeling Improves Storage-efficient Training [36.95646819348317]
Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks.
achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements.
Recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification.
In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training.
arXiv Detail & Related papers (2023-12-15T04:11:34Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Let Segment Anything Help Image Dehaze [12.163299570927302]
We propose a framework to integrate large-model prior into low-level computer vision tasks.
We demonstrate the effectiveness and applicability of large models in guiding low-level visual tasks.
arXiv Detail & Related papers (2023-06-28T02:02:19Z) - Enhancing Performance of Vision Transformers on Small Datasets through
Local Inductive Bias Incorporation [13.056764072568749]
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) on smaller datasets.
We propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs.
Our proposed module is memory and efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens.
arXiv Detail & Related papers (2023-05-15T11:23:18Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Transformer-Based Behavioral Representation Learning Enables Transfer
Learning for Mobile Sensing in Small Datasets [4.276883061502341]
We provide a neural architecture framework for mobile sensing data that can learn generalizable feature representations from time series.
This architecture combines benefits from CNN and Trans-former architectures to enable better prediction performance.
arXiv Detail & Related papers (2021-07-09T22:26:50Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Learning Visual Representations for Transfer Learning by Suppressing
Texture [38.901410057407766]
In self-supervised learning, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations.
We propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture.
We empirically show that our method achieves state-of-the-art results on object detection and image classification.
arXiv Detail & Related papers (2020-11-03T18:27:03Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.