Related papers: Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

URL: http://arxiv.org/abs/2408.14111v1
Date: Mon, 26 Aug 2024 08:55:16 GMT
Title: Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model
Authors: Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Md Hadiuzzaman, Muhammad Nazrul Islam, Jungpil Shin,
Abstract summary: We propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images. Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space.
Score: 0.5825410941577593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hand gesture-based sign language recognition (SLR) is one of the most advanced applications of machine learning, and computer vision uses hand gestures. Although, in the past few years, many researchers have widely explored and studied how to address BSL problems, specific unaddressed issues remain, such as skeleton and transformer-based BSL recognition. In addition, the lack of evaluation of the BSL model in various concealed environmental conditions can prove the generalized property of the existing model by facing daily life signs. As a consequence, existing BSL recognition systems provide a limited perspective of their generalisation ability as they are tested on datasets containing few BSL alphabets that have a wide disparity in gestures and are easy to differentiate. To overcome these limitations, we propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images. The main aim of utilising hand skeleton-based BSL data is to ensure the privacy and low-resolution sequence of images, which need minimum computational cost and low hardware configurations. Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space. Specifically, the use of Separable TCN combined with a powerful multi-head spatial-temporal attention architecture generated high-performance accuracy. The extensive experiments with a proposed dataset and two benchmark BSL datasets with a wide range of evaluations, such as intra- and inter-dataset evaluation settings, demonstrated that our proposed models achieve competitive performance with extremely low computational complexity and run faster than existing models.

Related papers

Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model [56.573203512455706]
Large-scale vision-language models (VLMs) have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets.<n>One approach to address this issue is to develop interpretable models by integrating language.<n>We propose LaZSL, a locally-aligned vision-language model for interpretable ZSL.
arXiv Detail & Related papers (2025-06-30T13:14:46Z)
How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? [9.094835948226063]
Gestures enable non-verbal human-robot communication in noisy environments like agile production.<n>Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input.<n> Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity.<n>This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based
arXiv Detail & Related papers (2025-06-25T19:36:45Z)
Bridging Brain with Foundation Models through Self-Supervised Learning [5.0273296425814635]
Foundation models (FMs) have redefined the capabilities of artificial intelligence.<n>These advances present a transformative opportunity for brain signal analysis.<n>This survey systematically reviews the emerging field of bridging brain signals with foundation models.
arXiv Detail & Related papers (2025-06-19T04:03:58Z)
Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition [1.949837893170278]
Hand gesture-based Sign Language Recognition serves as a crucial bridge between deaf and non-deaf individuals.<n>We propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture.<n>We validated our model through extensive experiments on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-21T04:57:18Z)
New keypoint-based approach for recognising British Sign Language (BSL) from sequences [53.397276621815614]
We present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model's performance is assessed using the BOBSL dataset, revealing that the keypoint-based approach surpasses its RGB-based counterpart in computational efficiency and memory usage.
arXiv Detail & Related papers (2024-12-12T17:20:27Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Explaining, Analyzing, and Probing Representations of Self-Supervised Learning Models for Sensor-based Human Activity Recognition [2.2082422928825136]
Self-supervised learning (SSL) frameworks have been extensively applied to sensor-based Human Activity Recognition (HAR) In this paper, we aim to analyze deep representations of two recent SSL frameworks, namely SimCLR and VICReg.
arXiv Detail & Related papers (2023-04-14T07:53:59Z)
Self-supervised Learning for Clustering of Wireless Spectrum Activity [0.16777183511743468]
We investigate the use of self-supervised learning (SSL) for exploring spectrum activities in a real-world unlabeled data. We show that SSL models achieve superior performance regarding the quality of extracted features and clustering performance.
arXiv Detail & Related papers (2022-09-22T11:19:49Z)
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation [27.857955394020475]
Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks. The quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. We propose a learnable and interpretable framework to combine SF and SSL representations.
arXiv Detail & Related papers (2022-04-05T20:09:15Z)
Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate. We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR) Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z)
Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model. A Transformer-based model along with a C3D model is used for hand detection and deep features extraction. A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z)
SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV) SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z)
Self-Supervised Learning of Graph Neural Networks: A Unified Review [50.71341657322391]
Self-supervised learning is emerging as a new paradigm for making use of large amounts of unlabeled samples. We provide a unified review of different ways of training graph neural networks (GNNs) using SSL. Our treatment of SSL methods for GNNs sheds light on the similarities and differences of various methods, setting the stage for developing new methods and algorithms.
arXiv Detail & Related papers (2021-02-22T03:43:45Z)
On Data-Augmentation and Consistency-Based Semi-Supervised Learning [77.57285768500225]
Recently proposed consistency-based Semi-Supervised Learning (SSL) methods have advanced the state of the art in several SSL tasks. Despite these advances, the understanding of these methods is still relatively limited.
arXiv Detail & Related papers (2021-01-18T10:12:31Z)
Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.