Related papers: Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

URL: http://arxiv.org/abs/2510.14596v1
Date: Thu, 16 Oct 2025 11:59:18 GMT
Title: Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering
Authors: Hugo Markoff, Jevgenijs Galaktionovs,
Abstract summary: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers.<n>This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

Related papers

Cross-Camera Cow Identification via Disentangled Representation Learning [1.469246311611757]
Existing animal identification methods excel in controlled, single-camera settings, but face severe challenges regarding cross-camera generalization.<n>This study proposes a cross-camera cow identification framework based on disentangled representation learning.
arXiv Detail & Related papers (2026-02-07T14:23:35Z)
Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study [0.19116784879310023]
Manual labeling of animal images remains a significant bottleneck in ecological research.<n>This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters.
arXiv Detail & Related papers (2026-02-03T08:27:22Z)
Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception [0.0]
This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection.<n>The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach.<n>The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features.
arXiv Detail & Related papers (2025-12-17T14:30:47Z)
Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective [80.10217707456046]
We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata.<n>We train a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags.<n>Our detectors deliver strong generalization to in-the-wild samples and robustness to common benign image perturbations.
arXiv Detail & Related papers (2025-12-05T11:53:18Z)
Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers [0.0]
We present a hierarchical re-classification system for the Animal Detect platform.<n>Our five-stage pipeline is evaluated on a segment of the LILA BC Desert Lion Conservation dataset.<n>After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy.
arXiv Detail & Related papers (2025-10-16T11:57:07Z)
Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring [0.06282171844772422]
This study developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows.<n>This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately.
arXiv Detail & Related papers (2025-08-03T13:36:40Z)
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification [65.46685389276443]
We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts.<n>We then formulate purification risk as the KL divergence between the joint distributions purification process.<n>We propose two variants for our CLIPure approach: CLI-Diff which models the likelihood of images' latent vectors, and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and a photo of a.''
arXiv Detail & Related papers (2025-02-25T13:09:34Z)
Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images [57.96659470133514]
Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor.
arXiv Detail & Related papers (2023-11-02T08:32:00Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Distance Estimation and Animal Tracking for Wildlife Camera Trapping [0.0]
We propose a fully automatic approach to estimate camera-to-animal distances. We leverage state-of-the-art relative MDE and a novel alignment procedure to estimate metric distances. We achieve a mean absolute distance estimation error of only 0.9864 meters at a precision of 90.3% and recall of 63.8%.
arXiv Detail & Related papers (2022-02-09T18:12:18Z)
Detecting Cattle and Elk in the Wild from Space [6.810164473908359]
Localizing and counting large ungulates in satellite imagery is an important task for supporting ecological studies. We propose a baseline method, CowNet, that simultaneously estimates the number of animals in an image (counts) and predicts their location at a pixel level (localizes) We specifically test the temporal generalization of the resulting models over a large landscape in Point Reyes Seashore, CA.
arXiv Detail & Related papers (2021-06-29T14:35:23Z)
Filtering Empty Camera Trap Images in Embedded Systems [0.0]
We present a comparative study on animal recognition models to analyze the trade-off between precision and inference latency on edge devices. The experiments show that, when using the same set of images for training, detectors achieve superior performance. Considering the high cost of generating labels for the detection problem, when there is a massive number of images labeled for classification, classifiers are able to reach results comparable to detectors but with half latency.
arXiv Detail & Related papers (2021-04-18T13:56:22Z)
Intra-Inter Camera Similarity for Unsupervised Person Re-Identification [50.85048976506701]
We study a novel intra-inter camera similarity for pseudo-label generation. We train our re-id model in two stages with intra-camera and inter-camera pseudo-labels, respectively. This simple intra-inter camera similarity produces surprisingly good performance on multiple datasets.
arXiv Detail & Related papers (2021-03-22T08:29:04Z)
Automatic image-based identification and biomass estimation of invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed. We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology. We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.