Related papers: Arabic Sign Language Recognition using Multimodal Approach

Arabic Sign Language Recognition using Multimodal Approach

URL: http://arxiv.org/abs/2601.17041v1
Date: Tue, 20 Jan 2026 09:21:43 GMT
Title: Arabic Sign Language Recognition using Multimodal Approach
Authors: Ghadeer Alanazi, Abir Benabid,
Abstract summary: Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community.<n>Existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras.<n>This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community. However, existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras. These systems struggle with limitations such as inadequate tracking of complex hand orientations and imprecise recognition of 3D hand movements. This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL. The system architecture includes two parallel subnetworks: a custom dense neural network for Leap Motion data, incorporating dropout and L2 regularization, and an image subnetwork based on a fine-tuned VGG16 model enhanced with data augmentation techniques. Feature representations from both modalities are concatenated in a fusion model and passed through fully connected layers, with final classification performed via SoftMax activation to analyze spatial and temporal features of hand gestures. The system was evaluated on a custom dataset comprising 18 ArSL words, of which 13 were correctly recognized, yielding an overall accuracy of 78%. These results offer preliminary insights into the viability of multimodal fusion for sign language recognition and highlight areas for further optimization and dataset expansion.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition [21.772052273755808]
We present a Languageguided and Motion-aware gait recognition framework, named LMGait.<n>In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.<n>We conduct extensive experiments across multiple datasets, and the results demonstrate the significant advantages of our proposed network.
arXiv Detail & Related papers (2026-01-17T06:33:05Z)
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z)
FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition [1.3499500088995462]
FusionEnsemble-Net is a attention-based ensemble of networks that fuses visual and motion data to enhance recognition accuracy.<n>Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44%.
arXiv Detail & Related papers (2025-08-12T21:44:23Z)
RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning [15.670921552151775]
RingMo-Agent is designed to handle multi-modal and multi-platform data.<n>It is supported by a large-scale vision-language dataset named RS-VL3M.<n>It proves effective in both visual understanding and sophisticated analytical tasks.
arXiv Detail & Related papers (2025-07-28T12:39:33Z)
EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM [103.7537991413311]
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z)
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning [14.038083767470019]
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.<n>In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities.<n>We show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.
arXiv Detail & Related papers (2025-05-23T09:06:09Z)
GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
We propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GPSR.<n>It explicitly combines multi-view RGB images and LiDAR point clouds into atemporally unified scene representation with the Multimodal Gaussian Splatting.<n>Our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability.
arXiv Detail & Related papers (2024-10-01T00:43:45Z)
SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs) We develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information. Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z)
Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate. We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR) Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.