Related papers: Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study

Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study

URL: http://arxiv.org/abs/2602.03894v1
Date: Tue, 03 Feb 2026 08:27:22 GMT
Title: Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study
Authors: Hugo Markoff, Stefan Hein Bengtson, Michael Ørsted,
Abstract summary: Manual labeling of animal images remains a significant bottleneck in ecological research.<n>This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters.
Score: 0.19116784879310023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.

Related papers

Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering [0.0]
Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers.<n>This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers.
arXiv Detail & Related papers (2025-10-16T11:59:18Z)
Overview of GeoLifeCLEF 2023: Species Composition Prediction with High Spatial Resolution at Continental Scale Using Remote Sensing [9.66382598562254]
We organized an open machine learning challenge called GeoLifeCLEF 2023.<n>The training dataset comprised of 5 million plant species distributed across Europe.<n>We evaluated models ability to predict species in 22 thousand small plots based on standardized surveys.
arXiv Detail & Related papers (2025-09-30T05:49:16Z)
BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning [60.80381372245902]
We find emergent behaviors in biological vision models via large-scale contrastive vision-language training.<n>We train BioCLIP 2 on TreeOfLife-200M to distinguish different species.<n>We identify emergent properties in the learned embedding space of BioCLIP 2.
arXiv Detail & Related papers (2025-05-29T17:48:20Z)
BeetleVerse: A Study on Taxonomic Classification of Ground Beetles [0.310688583550805]
Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity.<n>In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets.<n>Our results show that the Vision and Language Transformer combined with an head is the best performing model, with 97% accuracy at genus and species level.
arXiv Detail & Related papers (2025-04-18T01:06:37Z)
Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition [30.327642724046903]
Species196 is a large-scale semi-supervised dataset of 196-category invasive species. It collects over 19K images with expert-level accurate annotations Species196-L, and 1.2M unlabeled images of invasive species Species196-U.
arXiv Detail & Related papers (2023-09-25T14:46:01Z)
Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location. Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets. We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z)
Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective [51.70661197256033]
We propose ARCO, a semi-supervised contrastive learning framework with stratified group theory for medical image segmentation. We first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks. We experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings.
arXiv Detail & Related papers (2023-02-03T13:50:25Z)
Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders [50.689585476660554]
We propose a new fine-tuning strategy that includes positive-pair loss relaxation and random sentence sampling. Our approach consistently improves overall zero-shot pathology classification across four chest X-ray datasets and three pre-trained models.
arXiv Detail & Related papers (2022-12-14T06:04:18Z)
Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types. This is different from anomaly detection, whose goal is to divide anomalies from normal data. We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z)
Dynamic $\beta$-VAEs for quantifying biodiversity by clustering optically recorded insect signals [0.6091702876917281]
We propose an adaptive variant of the variational autoencoder (VAE) capable of clustering data by phylogenetic groups. We demonstrate the usefulness of the dynamic $beta$-VAE on optically recorded insect signals from regions of southern Scandinavia.
arXiv Detail & Related papers (2021-02-10T16:14:13Z)
Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species. A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z)
Automatic image-based identification and biomass estimation of invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed. We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology. We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.