Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening
- URL: http://arxiv.org/abs/2602.13507v1
- Date: Fri, 13 Feb 2026 22:36:42 GMT
- Title: Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening
- Authors: Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader, Tariq Adnan, Fazla Rabbi Mashrur, Sooyong Park, Praveen Kumar, Qasim Sudais, Natalia Chunga, Nami Shah, Jan Freyberg, Christopher Kanan, Ruth Schneider, Ehsan Hoque,
- Abstract summary: Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening.<n>We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD)<n>We evaluate seven state-of-the-art VFMs to determine their robustness in clinical screening.
- Score: 14.380171823525108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5
Related papers
- UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos [81.9180187964947]
We present UniSurg, a foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction.<n>To enable large-scale pretraining, we curate the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions.<n>These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
arXiv Detail & Related papers (2026-02-05T13:18:33Z) - NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification [56.133469598652624]
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding.<n>Neurosurgical Anatomy Benchmark (NeuroABench) is first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain.<n>NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures.
arXiv Detail & Related papers (2025-12-07T17:00:25Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - A versatile foundation model for cine cardiac magnetic resonance image analysis tasks [6.488550274514015]
We present a versatile foundation model that can perform a range of clinically-relevant image analysis tasks.<n>A multi-view convolution-transformer masked autoencoder, named as CineMA, was trained on 15 million cine images from 74,916 subjects.
arXiv Detail & Related papers (2025-05-31T19:12:34Z) - MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer [6.520396145278936]
We introduce a visual query-based video clip localization (VQ) method to assist sonographers by enabling them to capture a quick US sweep.<n>MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies.<n>Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens.
arXiv Detail & Related papers (2025-04-08T14:29:15Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis [3.1851272788128644]
Existing AI-based Parkinson's Disease detection methods primarily focus on unimodal analysis of motor or speech tasks.<n>We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy.<n>UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity.
arXiv Detail & Related papers (2024-06-21T04:02:19Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Automated interpretation of congenital heart disease from multi-view
echocardiograms [10.238433789459624]
Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China.
This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework.
arXiv Detail & Related papers (2023-11-30T18:37:21Z) - Exploring Motion Boundaries in an End-to-End Network for Vision-based
Parkinson's Severity Assessment [2.359557447960552]
We present an end-to-end deep learning framework to measure Parkinson's disease severity in two important components, hand movement and gait.
Our method leverages on an Inflated 3D CNN trained by a temporal segment framework to learn spatial and long temporal structure in video data.
We evaluate our proposed method on a dataset of 25 PD patients, obtaining 72.3% and 77.1% top-1 accuracy on hand movement and gait tasks respectively.
arXiv Detail & Related papers (2020-12-17T19:20:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.