Related papers: From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

URL: http://arxiv.org/abs/2603.02270v1
Date: Sat, 28 Feb 2026 21:27:38 GMT
Title: From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification
Authors: Vasiliy Kudryavtsev, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian, Alexander Ryzhkov,
Abstract summary: This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions.<n>We constructed a massive training corpus of 1.9 million photographs covering 695,091unique animals to support this investigation.
Score: 35.71275089934349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28\% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11\% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.

Related papers

Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench [48.60251555171943]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities.<n>In this work, we quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors.<n>We introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies.<n>Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic
arXiv Detail & Related papers (2026-01-06T05:58:50Z)
Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling [2.1290878226779877]
We introduce a novel AL Re-ID framework that leverages complementary clustering methods to uncover and target structurally ambiguous regions.<n>We show that our approach consistently outperforms existing foundational, USL and AL baselines.<n>Specifically, we report an average improvement of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods, respectively.
arXiv Detail & Related papers (2025-11-10T03:13:40Z)
Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition [5.45546363077543]
Cattle-CLIP is a multimodal deep learning framework for cattle behaviour recognition.<n>It is adapted from the large-scale image-language model CLIP by adding a temporal integration module.<n>Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting.
arXiv Detail & Related papers (2025-10-10T09:43:12Z)
Denoised Diffusion for Object-Focused Image Augmentation [0.6109833303919141]
We propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings.<n>Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes.<n>By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios.
arXiv Detail & Related papers (2025-10-10T03:03:40Z)
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection [108.5042835056188]
This work introduces Agent4FaceForgery to address two fundamental problems.<n>How to capture the diverse intents and iterative processes of human forgery creation.<n>How to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.
arXiv Detail & Related papers (2025-09-16T01:05:01Z)
AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer [26.738709781346678]
We introduce AniMer+, an extended version of our scalable AniMer framework.<n>A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT)<n>We produce two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds.
arXiv Detail & Related papers (2025-08-01T03:53:03Z)
A multi-head deep fusion model for recognition of cattle foraging events using sound and movement signals [0.2450783418670958]
This work introduces a deep neural network based on the fusion of acoustic and inertial signals.<n>The main advantage of this model is the combination of signals through the automatic extraction of features independently from each of them.
arXiv Detail & Related papers (2025-05-15T11:55:16Z)
Contrastive Visual Data Augmentation [119.51630737874855]
Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details.<n>We propose Contrastive visual Data Augmentation (CoDA) strategy to help LMMs better align nuanced visual features with language.<n>CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data.
arXiv Detail & Related papers (2025-02-24T23:05:31Z)
A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies. Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance. Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z)
Persistent Animal Identification Leveraging Non-Visual Markers [71.14999745312626]
We aim to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion. Our approach achieves 77% accuracy on this animal identification problem, and is able to reject spurious detections when the animals are hidden.
arXiv Detail & Related papers (2021-12-13T17:11:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.