POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation
- URL: http://arxiv.org/abs/2511.00270v1
- Date: Fri, 31 Oct 2025 21:44:59 GMT
- Title: POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation
- Authors: Abhinav Joshi, Vaibhav Sharma, Sanjeet Singh, Ashutosh Modi,
- Abstract summary: We propose POSESTITCH-SLT, a novel pre-training scheme for neural machine translation for sign languages.<n>We show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training.
- Score: 21.374625125480133
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. We propose POSESTITCH-SLT, a novel pre-training scheme that is inspired by linguistic-templates-based sentence generation technique. With translation comparison on two sign language datasets, How2Sign and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods for pose-based gloss-free translation. The results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.
Related papers
- Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z) - Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.<n>We incorporate additional contextual cues together with the signing video.<n>We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Changing the Representation: Examining Language Representation for
Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline.
We use language models such as BERT and Word2Vec to create better sentence level embeddings.
We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z) - Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.