Related papers: Advanced Gesture Recognition for Autism Spectrum Disorder Detection: Integrating YOLOv7, Video Augmentation, and VideoMAE for Naturalistic Video Analysis

Advanced Gesture Recognition for Autism Spectrum Disorder Detection: Integrating YOLOv7, Video Augmentation, and VideoMAE for Naturalistic Video Analysis

URL: http://arxiv.org/abs/2410.09339v3
Date: Sun, 17 Aug 2025 20:53:31 GMT
Title: Advanced Gesture Recognition for Autism Spectrum Disorder Detection: Integrating YOLOv7, Video Augmentation, and VideoMAE for Naturalistic Video Analysis
Authors: Amit Kumar Singh, Vrijendra Singh,
Abstract summary: Repetitive motor behaviors such as spinning, head banging, and arm flapping are key indicators for diagnosis of autism spectrum disorder (ASD)<n>This study focuses on distinguishing between children with ASD and typically developed (TD) peers by analyzing videos captured in natural, uncontrolled environments.<n>We adopt a pipeline integrating YOLOv7-based detection, extensive video augmentations, and the VideoMAE framework, which efficiently captures both spatial and temporal features through a high-ratio masking and reconstruction strategy.
Score: 10.298059998417104
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Deep learning and contactless sensing technologies have significantly advanced the automated assessment of human behaviors in healthcare. In the context of autism spectrum disorder (ASD), repetitive motor behaviors such as spinning, head banging, and arm flapping are key indicators for diagnosis. This study focuses on distinguishing between children with ASD and typically developed (TD) peers by analyzing videos captured in natural, uncontrolled environments. Using the publicly available Self-Stimulatory Behavior Dataset (SSBD), we address the classification task as a binary problem, ASD vs. TD, based on stereotypical repetitive gestures. We adopt a pipeline integrating YOLOv7-based detection, extensive video augmentations, and the VideoMAE framework, which efficiently captures both spatial and temporal features through a high-ratio masking and reconstruction strategy. Our proposed approach achieves 95% accuracy, 0.93 precision, 0.94 recall, and 0.94 F1 score, surpassing the previous state-of-the-art by a significant margin. These results demonstrate the effectiveness of combining advanced object detection, robust data augmentation, and masked autoencoder-based video modeling for reliable ASD vs. TD classification in naturalistic settings.

Related papers

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis [2.481802259298367]
This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD.<n>The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics.<n>Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity.
arXiv Detail & Related papers (2025-06-07T18:27:24Z)
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction [142.66410908560582]
Video virtual try-on aims to seamlessly dress a subject in a video physique with a specific garment.<n>We propose Dynamic Pose Interaction Diffusion Models (DPIDM) to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on.<n>DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency.
arXiv Detail & Related papers (2025-05-22T17:52:34Z)
A Novel Dataset for Video-Based Neurodivergent Classification Leveraging Extra-Stimulatory Behavior [15.88235629194724]
We introduce the Video ASD dataset-a dataset that contains video frame convolutional and attention map feature data.<n>Results show that our model effectively generalizes and understands key differences in the distinct movements of the children.
arXiv Detail & Related papers (2024-09-06T20:11:02Z)
Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder [3.6630139570443996]
We provide a dataset for training computer vision models to detect Autism Spectrum Disorder (ASD)-related phenotypic markers. We trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%.
arXiv Detail & Related papers (2024-08-23T17:55:58Z)
Video-Based Autism Detection with Deep Learning [0.0]
We develop a deep learning model that analyzes video clips of children reacting to sensory stimuli. Results show that our model effectively generalizes and understands key differences in the distinct movements of the children.
arXiv Detail & Related papers (2024-02-26T17:45:00Z)
Introducing SSBD+ Dataset with a Convolutional Pipeline for detecting Self-Stimulatory Behaviours in Children using raw videos [1.1874952582465603]
The authors propose a novel pipelined deep learning architecture to detect certain self-stimulatory behaviors that help in the diagnosis of autism spectrum disorder (ASD) An overall accuracy of around 81% was achieved from the proposed pipeline model that is targeted for real-time and hands-free automated diagnosis.
arXiv Detail & Related papers (2023-11-25T16:57:24Z)
Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
Language-Assisted Deep Learning for Autistic Behaviors Recognition [13.200025637384897]
We show that a vision-based problem behaviors recognition system can achieve high accuracy and outperform the previous methods by a large margin. We propose a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task.
arXiv Detail & Related papers (2022-11-17T02:58:55Z)
Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z)
Vision-Based Activity Recognition in Children with Autism-Related Behaviors [15.915410623440874]
We demonstrate the effect of a region-based computer vision system to help clinicians and parents analyze a child's behavior. The data is pre-processed by detecting the target child in the video to reduce the impact of background noise. Motivated by the effectiveness of temporal convolutional models, we propose both light-weight and conventional models capable of extracting action features from video frames.
arXiv Detail & Related papers (2022-08-08T15:12:27Z)
Context-Aware Sequence Alignment using 4D Skeletal Augmentation [67.05537307224525]
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. We propose a novel context-aware self-supervised learning architecture to align sequences of actions. Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions.
arXiv Detail & Related papers (2022-04-26T10:59:29Z)
Neurosymbolic hybrid approach to driver collision warning [64.02492460600905]
There are two main algorithmic approaches to autonomous driving systems. Deep learning alone has achieved state-of-the-art results in many areas. But sometimes it can be very difficult to debug if the deep learning model doesn't work.
arXiv Detail & Related papers (2022-03-28T20:29:50Z)
SOUL: An Energy-Efficient Unsupervised Online Learning Seizure Detection Classifier [68.8204255655161]
Implantable devices that record neural activity and detect seizures have been adopted to issue warnings or trigger neurostimulation to suppress seizures. For an implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to neural signal drifts. SOUL was fabricated in TSMC's 28 nm process occupying 0.1 mm2 and achieves 1.5 nJ/classification energy efficiency, which is at least 24x more efficient than state-of-the-art.
arXiv Detail & Related papers (2021-10-01T23:01:20Z)
A Spatio-temporal Attention-based Model for Infant Movement Assessment from Videos [44.71923220732036]
We develop a new method for fidgety movement assessment using human poses extracted from short clips. Human poses capture only relevant motion profiles of joints and limbs and are free from irrelevant appearance artifacts. Our experiments show that the proposed method achieves the ROC-AUC score of 81.87%, significantly outperforming existing competing methods with better interpretability.
arXiv Detail & Related papers (2021-05-20T14:31:54Z)
Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild. We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z)
Muti-view Mouse Social Behaviour Recognition with Deep Graphical Model [124.26611454540813]
Social behaviour analysis of mice is an invaluable tool to assess therapeutic efficacy of neurodegenerative diseases. Because of the potential to create rich descriptions of mouse social behaviors, the use of multi-view video recordings for rodent observations is increasingly receiving much attention. We propose a novel multiview latent-attention and dynamic discriminative model that jointly learns view-specific and view-shared sub-structures.
arXiv Detail & Related papers (2020-11-04T18:09:58Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
Detecting Parkinsonian Tremor from IMU Data Collected In-The-Wild using Deep Multiple-Instance Learning [59.74684475991192]
Parkinson's Disease (PD) is a slowly evolving neuro-logical disease that affects about 1% of the population above 60 years old. PD symptoms include tremor, rigidity and braykinesia. We present a method for automatically identifying tremorous episodes related to PD, based on IMU signals captured via a smartphone device.
arXiv Detail & Related papers (2020-05-06T09:02:30Z)
4D Spatio-Temporal Deep Learning with 4D fMRI Data for Autism Spectrum Disorder Classification [69.62333053044712]
We propose a 4D convolutional deep learning approach for ASD classification where we jointly learn from spatial and temporal data. We employ 4D neural networks and convolutional-recurrent models which outperform a previous approach with an F1-score of 0.71 compared to an F1-score of 0.65.
arXiv Detail & Related papers (2020-04-21T17:19:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.