Related papers: Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

URL: http://arxiv.org/abs/2505.05681v1
Date: Thu, 08 May 2025 22:48:52 GMT
Title: Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Authors: Giulio Cesare Mastrocinque Santo, Patrícia Izar, Irene Delval, Victor de Napole Gregolin, Nina S. T. Hirata,
Abstract summary: Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild.<n>We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys.
Score: 0.2796197251957245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft's X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in $Hits@5$ of $167\%$ for the 16 frames model and an uplift of $114\%$ for the 8 frame model on our domain data. Moreover, based on $NDCG@K$ results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance.

Related papers

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z)
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z)
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM. It is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z)
Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model. Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z)
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z)
REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z)
Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model? We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.