Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study
- URL: http://arxiv.org/abs/2505.01680v1
- Date: Sat, 03 May 2025 04:00:51 GMT
- Title: Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study
- Authors: Tamim Ahmed, Thanassis Rikakis,
- Abstract summary: Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable.<n>We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations.<n>This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.
- Score: 1.0463644684200606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.
Related papers
- Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening [0.7136933021609076]
This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment.<n>The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness.
arXiv Detail & Related papers (2026-01-17T03:35:39Z) - VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery [41.140934816875806]
We introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks.<n>StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes.<n>We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines.
arXiv Detail & Related papers (2025-09-02T18:48:37Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.<n>We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.<n>We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - Efficient Frame Extraction: A Novel Approach Through Frame Similarity and Surgical Tool Tracking for Video Segmentation [0.0]
We propose a technique that can efficiently eliminate redundant frames to reduce dataset size and computation time.<n>Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools.<n>We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases.
arXiv Detail & Related papers (2025-01-19T19:36:09Z) - Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin Representation [13.388576093178887]
We present a DT representation-based framework for surgical phase recognition from videos.<n>The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution and corrupted test samples.<n>Our findings lend support to the thesis that DT representations are effective in enhancing model robustness.
arXiv Detail & Related papers (2024-10-26T00:49:06Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - MR-STGN: Multi-Residual Spatio Temporal Graph Network Using Attention Fusion for Patient Action Assessment [0.30693357740321775]
We propose an automated approach for patient action assessment using a Multi-Residual Spatio Temporal Graph Network (MR-STGN)<n>The MR-STGN is specifically designed to capture the dynamics of patient actions.<n>We evaluate our model on the UI-PRMD dataset demonstrating its performance in accurately predicting real-time patient action scores.
arXiv Detail & Related papers (2023-12-21T01:09:52Z) - D-STGCNT: A Dense Spatio-Temporal Graph Conv-GRU Network based on transformer for assessment of patient physical rehabilitation [0.30693357740321775]
This paper introduces a new graph-based model for assessing rehabilitation exercises.<n>Dense connections and GRU mechanisms are used to rapidly process large 3D skeleton inputs.<n>The evaluation of our proposed approach on the KIMORE and UI-PRMD datasets highlighted its potential.
arXiv Detail & Related papers (2023-12-21T00:38:31Z) - Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond.
We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner.
Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z) - Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review
and Replicability Study [60.56194508762205]
We reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models.
We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation.
We present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models.
arXiv Detail & Related papers (2023-04-21T11:54:44Z) - Tele-EvalNet: A Low-cost, Teleconsultation System for Home based
Rehabilitation of Stroke Survivors using Multiscale CNN-LSTM Architecture [7.971065005161566]
We propose Tele-EvalNet, a novel system consisting of two components: a live feedback model and an overall performance evaluation model.
The live feedback model demonstrates feedback on exercise correctness with easy to understand instructions highlighted using color markers.
The overall performance evaluation model learns a mapping of joint data to scores, given to the performance by clinicians.
arXiv Detail & Related papers (2021-12-06T16:58:00Z) - Assessing YOLACT++ for real time and robust instance segmentation of
medical instruments in endoscopic procedures [0.5735035463793008]
Image-based tracking of laparoscopic instruments plays a fundamental role in computer and robotic-assisted surgeries.
To date, most of the existing models for instance segmentation of medical instruments were based on two-stage detectors.
We propose the addition of attention mechanisms to the YOLACT architecture that allows real-time instance segmentation of instruments.
arXiv Detail & Related papers (2021-03-30T00:09:55Z) - One-shot action recognition towards novel assistive therapies [63.23654147345168]
This work is motivated by the automated analysis of medical therapies that involve action imitation games.
The presented approach incorporates a pre-processing step that standardizes heterogeneous motion data conditions.
We evaluate the approach on a real use-case of automated video analysis for therapy support with autistic people.
arXiv Detail & Related papers (2021-02-17T19:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.