Related papers: Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

URL: http://arxiv.org/abs/2508.10457v1
Date: Thu, 14 Aug 2025 08:56:58 GMT
Title: Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers
Authors: Hanna Herasimchyk, Robin Labryga, Tomislav Prusina,
Abstract summary: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images.<n>The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift.<n>Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.

Related papers

Zero-Shot Segmentation through Prototype-Guidance for Multi-Label Plant Species Identification [0.5249805590164902]
This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a fine-grained multi-label species identification.<n>Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images.<n>The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots.
arXiv Detail & Related papers (2025-12-23T01:06:55Z)
Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images [2.7110107174608173]
The PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species.<n>It provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data.<n>The aim is to predict all the plant species present on a high-resolution plot image.
arXiv Detail & Related papers (2025-09-19T08:51:41Z)
Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification [0.0]
This paper presents our approach for the FungiCLEF 2025 competition.<n>It focuses on few-shot fine-grained visual categorization using the FungiTastic Few-Shot dataset.
arXiv Detail & Related papers (2025-07-11T01:21:21Z)
Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features [1.5495593104596397]
We train a model to predict the outcomes of 4,716 plant surveys in Europe.<n>We build a network based on the backbone of the Swin-Transformer Block for extracting temporal Cubes features.<n>We then design a hierarchical cross-attention mechanism capable of fusing features from multiple modalities.
arXiv Detail & Related papers (2025-01-05T20:30:07Z)
Multi-Label Plant Species Classification with Self-Supervised Vision Transformers [0.0]
We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks.
arXiv Detail & Related papers (2024-07-08T18:07:33Z)
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.<n>This paper explores the differences across various CLIP-trained vision backbones.<n>Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z)
Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions. We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z)
Effective Data Augmentation With Diffusion Models [45.18188726287581]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models.<n>Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples.<n>We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z)
Conviformers: Convolutionally guided Vision Transformer [5.964436882344729]
We present an in-depth analysis and describe the critical components for developing a system for the fine-grained categorization of plants from herbarium sheets. We introduce a convolutional transformer architecture called Conviformer which, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost. With our simple yet effective approach, we achieved SoTA on Herbarium 202x and iNaturalist 2019 dataset.
arXiv Detail & Related papers (2022-08-17T13:09:24Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer. Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image. Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species. A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z)
Automatic image-based identification and biomass estimation of invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed. We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology. We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.