Related papers: Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats

URL: http://arxiv.org/abs/2406.13002v1
Date: Tue, 18 Jun 2024 18:44:19 GMT
Title: Recurrence over Video Frames (RoVF) for the Re-identification of Meerkats
Authors: Mitchell Rogers, Kobe Knowles, Gaël Gendron, Shahrokh Heidari, David Arturo Soriano Valdez, Mihailo Azhar, Padriac O'Leary, Simon Eyre, Michael Witbrock, Patrice Delmas,
Abstract summary: We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip. We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo. Our method achieves a top-1 re-identification accuracy of $49%$, which is higher than that of the best DINOv2 model ($42%$)
Score: 4.512615837610558
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep learning approaches for animal re-identification have had a major impact on conservation, significantly reducing the time required for many downstream tasks, such as well-being monitoring. We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip. RoVF is trained using triplet loss based on the co-occurrence of individuals in the video frames, where the individual IDs are unavailable. We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo. Our method achieves a top-1 re-identification accuracy of $49\%$, which is higher than that of the best DINOv2 model ($42\%$). We found that the model can match observations of individuals where humans cannot, and our model (RoVF) performs better than the comparisons with minimal fine-tuning. In future work, we plan to improve these models by using pre-text tasks, apply them to animal behaviour classification, and perform a hyperparameter search to optimise the models further.

Related papers

PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild [50.656578456979496]
We introduce PriVi, a large-scale primate-centric video pretraining dataset.<n>We pretrain V-JEPA, a large-scale video model, on PriVi to learn primate-specific representations.<n>Results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization.
arXiv Detail & Related papers (2025-11-12T19:27:40Z)
Web-Scale Collection of Video Data for 4D Animal Reconstruction [26.179284343904897]
We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips.<n>Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works.<n>We present Animal-in-Motion, a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions.
arXiv Detail & Related papers (2025-11-03T02:40:06Z)
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos [0.2796197251957245]
Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild.<n>We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys.
arXiv Detail & Related papers (2025-05-08T22:48:52Z)
Multispecies Animal Re-ID Using a Large Community-Curated Dataset [0.19418036471925312]
We construct a dataset that includes 49 species, 37K individual animals, and 225K images, using this data to train a single embedding network for all species. Our model consistently outperforms models trained separately on each species, achieving an average gain of 12.5% in top-1 accuracy. The model is already in production use for 60+ species in a large-scale wildlife monitoring system.
arXiv Detail & Related papers (2024-12-07T09:56:33Z)
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation. We map the reference image along with the posture guidance and noise video into a common feature space. We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z)
Score-Guided Diffusion for 3D Human Recovery [10.562998991986102]
We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. We evaluate our approach on three settings/applications: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences.
arXiv Detail & Related papers (2024-03-14T17:56:14Z)
Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval [50.47192086219752]
$texttABEL$ is a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. By either fine-tuning $texttABEL$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.
arXiv Detail & Related papers (2023-11-27T06:22:57Z)
SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval. We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR. We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z)
TempNet: Temporal Attention Towards the Detection of Animal Behaviour in Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos. TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder. We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z)
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories [80.30216777363057]
We introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.
arXiv Detail & Related papers (2022-11-07T22:42:42Z)
SuperAnimal pretrained pose estimation models for behavioral analysis [42.206265576708255]
Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models.
arXiv Detail & Related papers (2022-03-14T18:46:57Z)
You Only Need One Model for Open-domain Question Answering [26.582284346491686]
Recent works for Open-domain Question Answering refer to an external knowledge base using a retriever model. We propose casting the retriever and the reranker as hard-attention mechanisms applied sequentially within the transformer architecture. We evaluate our model on Natural Questions and TriviaQA open datasets and our model outperforms the previous state-of-the-art model by 1.0 and 0.7 exact match scores.
arXiv Detail & Related papers (2021-12-14T13:21:11Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
Fine-Grained Re-Identification [1.8275108630751844]
This paper proposes a computationally efficient fine-grained ReID model, FGReID, which is among the first models to unify image and video ReID. FGReID takes advantage of video-based pre-training and spatial feature attention to improve performance on both video and image ReID tasks.
arXiv Detail & Related papers (2020-11-26T21:04:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.