Related papers: LLMs are Good Sign Language Translators

LLMs are Good Sign Language Translators

URL: http://arxiv.org/abs/2404.00925v1
Date: Mon, 1 Apr 2024 05:07:13 GMT
Title: LLMs are Good Sign Language Translators
Authors: Jia Gong, Lin Geng Foo, Yixuan He, Hossein Rahmani, Jun Liu,
Abstract summary: Sign Language Translation is a challenging task that aims to translate sign videos into spoken language. We propose a novel SignLLM framework to transform sign videos into a language-like representation. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
Score: 19.259163728870696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.

Related papers

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z)
Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z)
LLaVA-SLT: Visual Language Tuning for Sign Language Translation [42.20090162339927]
Recent advancements in Sign Language Translation (SLT) have shown promise, yet they often largely lag behind gloss-based approaches in terms of accuracy. We introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings. Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T08:01:08Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation [6.688680877428467]
We propose a novel gloss-free Multimodal Sign Language Translation framework. We generate detailed textual descriptions of sign language components using multimodal large language models. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily.
arXiv Detail & Related papers (2024-11-25T09:01:41Z)
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses. This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language. We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z)
T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language. Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z)
SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation [3.9711029428461653]
We introduce a new task named multi-channel sign language translation (MCSLT) We present a novel metric, SignBLEU, designed to capture multiple signal channels. We found that SignBLEU consistently correlates better with human judgment than competing metrics.
arXiv Detail & Related papers (2024-06-10T05:01:26Z)
SignLLM: Sign Language Production Large Language Models [31.557139567708067]
We propose SignLLM, a multilingual Sign Language Production (SLP) large language model. Two novel SLP modes MLSF and Prompt2LangGloss allow sign language gestures generation from query texts input and question-style prompts input respectively. We extensively evaluate SignLLM, demonstrating that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.
arXiv Detail & Related papers (2024-05-17T12:01:43Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation [30.008980708977095]
We introduce Sign2GPT, a novel framework for sign language translation. We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses. We evaluate our approach on two public benchmark sign language translation datasets.
arXiv Detail & Related papers (2024-05-07T10:00:38Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation. We propose SimulSLT, the first end-to-end simultaneous sign language translation model. SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z)
Improving Sign Language Translation with Monolingual Data by Sign Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training. With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence. Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.