Multi-Label Plant Species Classification with Self-Supervised Vision Transformers
- URL: http://arxiv.org/abs/2407.06298v1
- Date: Mon, 8 Jul 2024 18:07:33 GMT
- Title: Multi-Label Plant Species Classification with Self-Supervised Vision Transformers
- Authors: Murilo Gustineli, Anthony Miyaguchi, Ian Stalter,
- Abstract summary: We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition.
To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing.
Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition, focusing on the multi-label plant species classification. Our method leverages both base and fine-tuned DINOv2 models to extract generalized feature embeddings. We train classifiers to predict multiple plant species within a single image using these rich embeddings. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing, ensuring efficient memory management and processing across a cluster of workers. Our data processing pipeline transforms images into grids of tiles, classifying each tile, and aggregating these predictions into a consolidated set of probabilities. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks. Our code is available at https://github.com/dsgt-kaggle-clef/plantclef-2024.
Related papers
- CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion [0.0]
We present a novel lightweight hybrid network that pairs Convolution with Transformers.
We fuse the local responses acquired from the convolution path with the global responses acquired from the MFCA module.
Experiments demonstrate that our variants achieve state-of-the-art performance, whether trained from scratch on large data or even with low-data regime.
arXiv Detail & Related papers (2024-07-09T08:47:13Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Convolutional autoencoder-based multimodal one-class classification [80.52334952912808]
One-class classification refers to approaches of learning using data from a single class only.
We propose a deep learning one-class classification method suitable for multimodal data.
arXiv Detail & Related papers (2023-09-25T12:31:18Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Memory Classifiers: Two-stage Classification for Robustness in Machine
Learning [19.450529711560964]
We propose a new method for classification which can improve robustness to distribution shifts.
We combine expert knowledge about the high-level" structure of the data with standard classifiers.
We show improvements which push beyond standard data augmentation techniques.
arXiv Detail & Related papers (2022-06-10T18:44:45Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Feature transforms for image data augmentation [74.12025519234153]
In image classification, many augmentation approaches utilize simple image manipulation algorithms.
In this work, we build ensembles on the data level by adding images generated by combining fourteen augmentation approaches.
Pretrained ResNet50 networks are finetuned on training sets that include images derived from each augmentation method.
arXiv Detail & Related papers (2022-01-24T14:12:29Z) - Multi-Domain Few-Shot Learning and Dataset for Agricultural Applications [0.0]
We propose a method to learn from a few samples to automatically classify different pests, plants, and their diseases.
We learn a feature extractor to generate embeddings and then update the embeddings using Transformers.
We conduct 42 experiments in total to comprehensively analyze the model and it achieves up to 14% and 24% performance gains on few-shot image classification benchmarks.
arXiv Detail & Related papers (2021-09-21T04:20:18Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - Variational Clustering: Leveraging Variational Autoencoders for Image
Clustering [8.465172258675763]
Variational Autoencoders (VAEs) naturally lend themselves to learning data distributions in a latent space.
We propose a method based on VAEs where we use a Gaussian Mixture prior to help cluster the images accurately.
Our method simultaneously learns a prior that captures the latent distribution of the images and a posterior to help discriminate well between data points.
arXiv Detail & Related papers (2020-05-10T09:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.