Related papers: AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

URL: http://arxiv.org/abs/2507.19840v1
Date: Sat, 26 Jul 2025 07:28:33 GMT
Title: AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition
Authors: Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, Assane Gueye,
Abstract summary: Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1\% in WER score compared to the best existing method.

Related papers

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z)
StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation [33.695308849489784]
We propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs.<n>Specifically, we train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton.<n>We design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features.
arXiv Detail & Related papers (2025-06-16T07:09:51Z)
Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization [1.8024397171920885]
We train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy.<n>Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings.<n>Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-DailyPHOENIX datasets.
arXiv Detail & Related papers (2025-04-09T06:14:19Z)
SignRep: Enhancing Self-Supervised Sign Representations [30.008980708977095]
Sign language representation learning presents unique challenges due to the complex-temporal nature of signs and the scarcity of labeled datasets.<n>We introduce a scalable, self-supervised framework for sign representation learning.<n>Our model does not require skeletal keypoints during inference, avoiding the limitations of key-point-based models during downstream tasks.<n>It excels in sign dictionary retrieval and sign translation, surpassing standard MAE pre-training and skeletal-based representations in retrieval.
arXiv Detail & Related papers (2025-03-11T15:20:01Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information. Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z)
Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z)
A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Self-Sufficient Framework for Continuous Sign Language Recognition [75.60327502570242]
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations. DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
arXiv Detail & Related papers (2023-03-21T11:42:57Z)
Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning. CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.