Related papers: Spatio-temporal transformer to support automatic sign language translation

Spatio-temporal transformer to support automatic sign language translation

URL: http://arxiv.org/abs/2502.02587v1
Date: Tue, 04 Feb 2025 18:59:19 GMT
Title: Spatio-temporal transformer to support automatic sign language translation
Authors: Christian Ruiz, Fabio Martinez,
Abstract summary: This paper introduces a Transformer-based architecture that encodestemporal motion gestures, preserving both local and long-range spatial information.<n>The proposed approach was validated on the Colombian Sign Language Translation dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign Language Translation (SLT) systems support hearing-impaired people communication by finding equivalences between signed and spoken languages. This task is however challenging due to multiple sign variations, complexity in language and inherent richness of expressions. Computational approaches have evidenced capabilities to support SLT. Nonetheless, these approaches remain limited to cover gestures variability and support long sequence translations. This paper introduces a Transformer-based architecture that encodes spatio-temporal motion gestures, preserving both local and long-range spatial information through the use of multiple convolutional and attention mechanisms. The proposed approach was validated on the Colombian Sign Language Translation Dataset (CoL-SLTD) outperforming baseline approaches, and achieving a BLEU4 of 46.84%. Additionally, the proposed approach was validated on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T), achieving a BLEU4 score of 30.77%, demonstrating its robustness and effectiveness in handling real-world variations

Related papers

TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks [0.3749861135832072]
TSLFormer treats sign gestures as ordered, string-like language.<n>Method only works with 3D joint positions extracted from Google's Mediapipe library.<n>Results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals.
arXiv Detail & Related papers (2025-05-11T14:30:56Z)
A multitask transformer to sign language translation using motion gesture primitives [0.6249768559720122]
This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2.
arXiv Detail & Related papers (2025-03-25T13:53:25Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
Efficient Speech Translation with Dynamic Latent Perceivers [0.0]
Transformers have been the dominant architecture for Speech Translation, achieving significant improvements in translation quality. We propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation. We also introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead.
arXiv Detail & Related papers (2022-10-28T16:52:48Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
Multi-channel Transformers for Multi-articulatory Sign Language Translation [59.38247587308604]
We tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself.
arXiv Detail & Related papers (2020-09-01T09:10:55Z)
Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.