Related papers: SignSpeak: Open-Source Time Series Classification for ASL Translation

SignSpeak: Open-Source Time Series Classification for ASL Translation

URL: http://arxiv.org/abs/2407.12020v2
Date: Thu, 18 Jul 2024 20:36:03 GMT
Title: SignSpeak: Open-Source Time Series Classification for ASL Translation
Authors: Aditya Makkar, Divya Makkar, Aarav Patel, Liam Hebert,
Abstract summary: We propose a low-cost, real-time ASL-to-speech translation glove and an exhaustive training dataset of sign language patterns. We benchmarked this dataset with supervised learning models, such as LSTMs, GRUs and Transformers, where our best model achieved 92% accuracy. Our open-source dataset, models and glove designs provide an accurate and efficient ASL translator while maintaining cost-effectiveness.
Score: 0.12499537119440243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The lack of fluency in sign language remains a barrier to seamless communication for hearing and speech-impaired communities. In this work, we propose a low-cost, real-time ASL-to-speech translation glove and an exhaustive training dataset of sign language patterns. We then benchmarked this dataset with supervised learning models, such as LSTMs, GRUs and Transformers, where our best model achieved 92% accuracy. The SignSpeak dataset has 7200 samples encompassing 36 classes (A-Z, 1-10) and aims to capture realistic signing patterns by using five low-cost flex sensors to measure finger positions at each time step at 36 Hz. Our open-source dataset, models and glove designs, provide an accurate and efficient ASL translator while maintaining cost-effectiveness, establishing a framework for future work to build on.

Related papers

TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks [0.3749861135832072]
TSLFormer treats sign gestures as ordered, string-like language.<n>Method only works with 3D joint positions extracted from Google's Mediapipe library.<n>Results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals.
arXiv Detail & Related papers (2025-05-11T14:30:56Z)
SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction [65.1590372072555]
We introduce SHuBERT, a self-supervised transformer encoder that learns strong representations from American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input. SHuBERT achieves state-of-the-art performance across multiple benchmarks.
arXiv Detail & Related papers (2024-11-25T03:13:08Z)
The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge [6.481946043182915]
We introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.
arXiv Detail & Related papers (2024-11-06T00:16:16Z)
SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [18.701864254184308]
Self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker.
arXiv Detail & Related papers (2024-08-20T12:09:58Z)
BAUST Lipi: A BdSL Dataset with Deep Learning Based Bangla Sign Language Recognition [0.5497663232622964]
Sign language research is burgeoning to enhance communication with the deaf community. One significant barrier has been the lack of a comprehensive Bangla sign language dataset. We introduce a new BdSL dataset comprising alphabets totaling 18,000 images, with each image being 224x224 pixels in size. We devised a hybrid Convolutional Neural Network (CNN) model, integrating multiple convolutional layers, activation functions, dropout techniques, and LSTM layers.
arXiv Detail & Related papers (2024-08-20T03:35:42Z)
EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. We propose efficient transformer-based framework for event-based SLR and SLT tasks. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z)
iSign: A Benchmark for Indian Sign Language Processing [5.967764101493575]
iSign is a benchmark for Indian Sign Language (ISL) processing. We release one of the largest ISL-English datasets with more than 118K video-sentence/phrase pairs. We provide insights into the proposed benchmarks with a few linguistic insights into the workings of ISL.
arXiv Detail & Related papers (2024-07-07T15:07:35Z)
Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z)
BdSLW60: A Word-Level Bangla Sign Language Dataset [3.8631510994883254]
We create a comprehensive BdSL word-level dataset named BdSLW60 in an unconstrained and natural setting. The dataset encompasses 60 Bangla sign words, with a significant scale of 9307 video trials provided by 18 signers under the supervision of a sign language professional. We report the benchmarking of our BdSLW60 dataset using the Support Vector Machine (SVM) with testing accuracy up to 67.6% and an attention-based bi-LSTM with testing accuracy up to 75.1%.
arXiv Detail & Related papers (2024-02-13T18:02:58Z)
Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z)
ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition [6.296362537531586]
Sign languages are used as a primary language by approximately 70 million D/deaf people world-wide. To help tackle this problem, we release ASL Citizen, the first crowdsourced Isolated Sign Language Recognition dataset. We propose that this dataset be used for sign language dictionary retrieval for American Sign Language (ASL), where a user demonstrates a sign to their webcam to retrieve matching signs from a dictionary.
arXiv Detail & Related papers (2023-04-12T15:52:53Z)
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts. Data is thus a bottleneck for training effective sign language translation models. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.