AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis
- URL: http://arxiv.org/abs/2505.00569v1
- Date: Wed, 30 Apr 2025 12:26:37 GMT
- Title: AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis
- Authors: Enmin Zhong, Carlos R. del-Blanco, Daniel Berjón, Fernando Jaureguizar, Narciso García,
- Abstract summary: We propose AnimalMotionCLIP to overcome the challenges of integrating motion information and an effective temporal modeling scheme.<n> Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.
- Score: 45.610770404198874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.
Related papers
- GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding [2.79453284883108]
This study evaluates the visual perception capabilities of multimodal large language models in animal activity recognition.
We found that while current multimodal LLMs require improvement in semantic correspondence and time perception, they have initially demonstrated visual perception capabilities for animal activity recognition.
arXiv Detail & Related papers (2024-06-14T07:30:26Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - CNN-Based Action Recognition and Pose Estimation for Classifying Animal
Behavior from Videos: A Survey [0.0]
Action recognition, classifying activities performed by one or more subjects in a trimmed video, forms the basis of many techniques.
Deep learning models for human action recognition have progressed over the last decade.
Recent interest in research that incorporates deep learning-based action recognition for classification has increased.
arXiv Detail & Related papers (2023-01-15T20:54:44Z) - In-situ animal behavior classification using knowledge distillation and
fixed-point quantization [6.649514998517633]
We take a deep and complex convolutional neural network, known as residual neural network (ResNet), as the teacher model.
We implement both unquantized and quantized versions of the developed KD-based models on the embedded systems of our purpose-built collar and ear tag devices.
arXiv Detail & Related papers (2022-09-09T06:07:17Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - CLAMP: Prompt-based Contrastive Learning for Connecting Language and
Animal Pose [70.59906971581192]
We introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose effectively.
The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training.
Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2022-06-23T14:51:42Z) - SemiMultiPose: A Semi-supervised Multi-animal Pose Estimation Framework [10.523555645910255]
Multi-animal pose estimation is essential for studying animals' social behaviors in neuroscience and neuroethology.
We propose a novel semi-supervised architecture for multi-animal pose estimation, leveraging the pervasive structures in unlabeled frames in behavior videos.
The resulting algorithm will provide superior multi-animal pose estimation results on three animal experiments.
arXiv Detail & Related papers (2022-04-14T16:06:55Z) - SuperAnimal pretrained pose estimation models for behavioral analysis [42.206265576708255]
Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts.
We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models.
arXiv Detail & Related papers (2022-03-14T18:46:57Z) - Transferring Dense Pose to Proximal Animal Classes [83.84439508978126]
We show that it is possible to transfer the knowledge existing in dense pose recognition for humans, as well as in more general object detectors and segmenters, to the problem of dense pose recognition in other classes.
We do this by establishing a DensePose model for the new animal which is also geometrically aligned to humans.
We also introduce two benchmark datasets labelled in the manner of DensePose for the class chimpanzee and use them to evaluate our approach.
arXiv Detail & Related papers (2020-02-28T21:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.