Related papers: Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

URL: http://arxiv.org/abs/2405.04164v1
Date: Tue, 7 May 2024 10:00:38 GMT
Title: Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation
Authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden,
Abstract summary: We introduce Sign2GPT, a novel framework for sign language translation. We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses. We evaluate our approach on two public benchmark sign language translation datasets.
Score: 30.008980708977095
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automatic Sign Language Translation requires the integration of both computer vision and natural language processing to effectively bridge the communication gap between sign and spoken languages. However, the deficiency in large-scale training data to support sign language translation means we need to leverage resources from spoken language. We introduce, Sign2GPT, a novel framework for sign language translation that utilizes large-scale pretrained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial for sign language translation, due to the constraints imposed by limited dataset sizes and the computational requirements when training with long sign videos. We also propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin.

Related papers

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text. We incorporate additional contextual cues together with the signing video. We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z)
LLaVA-SLT: Visual Language Tuning for Sign Language Translation [42.20090162339927]
Recent advancements in Sign Language Translation (SLT) have shown promise, yet they often largely lag behind gloss-based approaches in terms of accuracy. We introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings. Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T08:01:08Z)
SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale [22.49602248323602]
A persistent challenge in sign language video processing is how we learn representations of sign language. Our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training.
arXiv Detail & Related papers (2024-06-11T03:00:41Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
LLMs are Good Sign Language Translators [19.259163728870696]
Sign Language Translation is a challenging task that aims to translate sign videos into spoken language. We propose a novel SignLLM framework to transform sign videos into a language-like representation. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
arXiv Detail & Related papers (2024-04-01T05:07:13Z)
Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
Explore More Guidance: A Task-aware Instruction Network for Sign Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos. In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z)
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts. Data is thus a bottleneck for training effective sign language translation models. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z)
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.