Related papers: MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

URL: http://arxiv.org/abs/2509.00030v2
Date: Fri, 05 Sep 2025 15:41:49 GMT
Title: MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation
Authors: Marshall Thomas, Edward Fish, Richard Bowden,
Abstract summary: We introduce MultiStream-LLM, a modular framework designed to overcome limitations in automated sign language translation.<n>Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and 73.2% letter accuracy on the ChicagoFSWildPlus fingerspelling dataset.
Score: 19.197388907510746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

Related papers

MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation [78.75809158246723]
We present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional and supports efficient parallel multi-token generation.<n>We also introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-Hearing, and 3D-space objectives.<n>MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%.
arXiv Detail & Related papers (2026-01-27T13:06:47Z)
POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation [21.374625125480133]
We propose POSESTITCH-SLT, a novel pre-training scheme for neural machine translation for sign languages.<n>We show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training.
arXiv Detail & Related papers (2025-10-31T21:44:59Z)
AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition [0.0]
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
arXiv Detail & Related papers (2025-07-26T07:28:33Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
A Token-level Contrastive Framework for Sign Language Translation [9.185037439012952]
Sign Language Translation is a promising technology to bridge the communication gap between the deaf and the hearing people. We propose ConSLT, a novel token-level. textbfContrastive learning framework for textbfSign textbfLanguage. textbfTranslation.
arXiv Detail & Related papers (2022-04-11T07:33:26Z)
SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation. We propose SimulSLT, the first end-to-end simultaneous sign language translation model. SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z)
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.