Related papers: Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

URL: http://arxiv.org/abs/2307.14768v1
Date: Thu, 27 Jul 2023 10:59:18 GMT
Title: Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining
Authors: Benjia Zhou and Zhigang Chen and Albert Clap\'es and Jun Wan and Yanyan Liang and Sergio Escalera and Zhen Lei and Du Zhang
Abstract summary: Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
Score: 56.26550923909137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods. Our code is available at https://github.com/zhoubenjia/GFSLT-VLP.

Related papers

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation [33.48154010885497]
Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text.<n>We propose a two-phase, dual visual encoder framework for gloss-free SLT.
arXiv Detail & Related papers (2025-07-14T14:09:36Z)
Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z)
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text. We incorporate additional contextual cues together with the signing video. We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z)
LLaVA-SLT: Visual Language Tuning for Sign Language Translation [42.20090162339927]
Recent advancements in Sign Language Translation (SLT) have shown promise, yet they often largely lag behind gloss-based approaches in terms of accuracy. We introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings. Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T08:01:08Z)
DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model [9.452839238264286]
We propose DiffSLT, a novel gloss-free sign language translation framework. DiffSLT transforms random noise into the target latent representation conditioned on the visual features of input video. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses visual features, providing key textual guidance and reducing the modality gap.
arXiv Detail & Related papers (2024-11-26T09:26:36Z)
C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval [37.12863427950066]
We introduce an innovative pretraining paradigm for gloss-free SLRL, called C$2$RL. C$2$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign.
arXiv Detail & Related papers (2024-08-19T12:42:10Z)
A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production [9.065171626657818]
This paper addresses the challenges associated with the use of glosses in Sign Language Translation (SLT) and Sign Language Production Language (SLP) We introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture thetemporal inherent sign language. Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches performance of previous SLT and SLP methods.
arXiv Detail & Related papers (2024-07-03T07:12:36Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning. It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone. It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
Changing the Representation: Examining Language Representation for Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline. We use language models such as BERT and Word2Vec to create better sentence level embeddings. We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z)
Explore More Guidance: A Task-aware Instruction Network for Sign Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos. In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z)
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts. Data is thus a bottleneck for training effective sign language translation models. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z)
Improving Sign Language Translation with Monolingual Data by Sign Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training. With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence. Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.