Related papers: SignBart -- New approach with the skeleton sequence for Isolated Sign language Recognition

SignBart -- New approach with the skeleton sequence for Isolated Sign language Recognition

URL: http://arxiv.org/abs/2506.21592v1
Date: Wed, 18 Jun 2025 07:07:36 GMT
Title: SignBart -- New approach with the skeleton sequence for Isolated Sign language Recognition
Authors: Tinh Nguyen, Minh Khue Phan Tran,
Abstract summary: This study presents a new novel SLR approach that overcomes the challenge of independently extracting meaningful information from the x and y coordinates of skeleton sequences.<n>With only 749,888 parameters, the model achieves 96.04% accuracy on the LSA-64 dataset.<n>The model also demonstrates excellent performance and generalization across WLASL and ASL-Citizen datasets.
Score: 0.17578923069457017
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign language recognition is crucial for individuals with hearing impairments to break communication barriers. However, previous approaches have had to choose between efficiency and accuracy. Such as RNNs, LSTMs, and GCNs, had problems with vanishing gradients and high computational costs. Despite improving performance, transformer-based methods were not commonly used. This study presents a new novel SLR approach that overcomes the challenge of independently extracting meaningful information from the x and y coordinates of skeleton sequences, which traditional models often treat as inseparable. By utilizing an encoder-decoder of BART architecture, the model independently encodes the x and y coordinates, while Cross-Attention ensures their interrelation is maintained. With only 749,888 parameters, the model achieves 96.04% accuracy on the LSA-64 dataset, significantly outperforming previous models with over one million parameters. The model also demonstrates excellent performance and generalization across WLASL and ASL-Citizen datasets. Ablation studies underscore the importance of coordinate projection, normalization, and using multiple skeleton components for boosting model efficacy. This study offers a reliable and effective approach for sign language recognition, with strong potential for enhancing accessibility tools for the deaf and hard of hearing.

Related papers

Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings [0.0]
Arabic dialect recognition presents a significant challenge due to the linguistic diversity of Arabic and the scarcity of large annotated datasets.<n>This research investigates hybrid modeling strategies that integrate classical signal processing techniques with deep learning architectures.
arXiv Detail & Related papers (2025-06-26T15:36:25Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model [1.5566524830295307]
This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions.<n>A hybrid model combining residual convolutional neural networks and bidirectional gated units (BiGRU) is proposed.<n> Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets.
arXiv Detail & Related papers (2024-12-14T15:11:42Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature [1.7779568951268254]
We introduce a novel methodology for voice pathology detection using the publicly available Saarbr"ucken Voice Database.<n>We evaluate six machine learning (ML) algorithms -- support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest, and AdaBoost.<n>Our approach 85.61%, 84.69% and 85.22% unweighted average recall (UAR) for females, males and combined results respectively.
arXiv Detail & Related papers (2024-10-14T14:17:52Z)
Attention vs LSTM: Improving Word-level BISINDO Recognition [0.0]
Indonesia ranks fourth globally in the number of deaf cases.<n>Individuals with hearing impairments often find communication challenging, necessitating the use of sign language.<n>This study aims to explore the application of AI in developing models for a simplified sign language translation app and dictionary.
arXiv Detail & Related papers (2024-09-03T15:17:39Z)
Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems. TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z)
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems. We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains. STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z)
Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs) BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
Improving auditory attention decoding performance of linear and non-linear methods using state-space model [21.40315235087551]
Recent advances in electroencephalography have shown that it is possible to identify the target speaker from single-trial EEG recordings. AAD methods reconstruct the attended speech envelope from EEG recordings, based on a linear least-squares cost function or non-linear neural networks. We investigate a state-space model using correlation coefficients obtained with a small correlation window to improve the decoding performance.
arXiv Detail & Related papers (2020-04-02T09:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.