Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression
- URL: http://arxiv.org/abs/2510.19160v1
- Date: Wed, 22 Oct 2025 01:33:39 GMT
- Title: Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression
- Authors: Paimon Goulart, Jordan Steinhauser, Kylene Shuler, Edward Korzus, Jia Chen, Evangelos E. Papalexakis,
- Abstract summary: This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse.<n>We use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing.
- Score: 5.170961907232911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.
Related papers
- Rodent-Bench [14.876393544574688]
We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage.<n>We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task.
arXiv Detail & Related papers (2026-02-20T15:14:38Z) - Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations [14.972702558607557]
We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets.<n>We propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results.
arXiv Detail & Related papers (2025-10-20T08:58:23Z) - Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy [2.294223504228228]
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems.<n>Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning.<n>Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview.
arXiv Detail & Related papers (2024-12-23T18:15:19Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition [2.7532797256542403]
Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas.<n>We introduce our comprehensive Fitness Multimodal Activity dataset (FiMAD) to enhance HAR performance across various modalities.<n>We show that FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH.
arXiv Detail & Related papers (2024-06-06T08:42:36Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Coarse-to-Fine Knowledge-Enhanced Multi-Interest Learning Framework for
Multi-Behavior Recommendation [52.89816309759537]
Multi-types of behaviors (e.g., clicking, adding to cart, purchasing, etc.) widely exist in most real-world recommendation scenarios.
The state-of-the-art multi-behavior models learn behavior dependencies indistinguishably with all historical interactions as input.
We propose a novel Coarse-to-fine Knowledge-enhanced Multi-interest Learning framework to learn shared and behavior-specific interests for different behaviors.
arXiv Detail & Related papers (2022-08-03T05:28:14Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions [39.265388879471686]
We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) dataset.
Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay.
The CalMS21 dataset is part of the Multi-Agent Behavior Challenge 2021 and for our next step, our goal is to incorporate datasets from other domains studying multi-agent behavior.
arXiv Detail & Related papers (2021-04-06T17:58:47Z) - Invariant Feature Learning for Sensor-based Human Activity Recognition [11.334750079923428]
We present an invariant feature learning framework (IFLF) that extracts common information shared across subjects and devices.
Experiments demonstrated that IFLF is effective in handling both subject and device diversion across popular open datasets and an in-house dataset.
arXiv Detail & Related papers (2020-12-14T21:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.