Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining
- URL: http://arxiv.org/abs/2307.14768v1
- Date: Thu, 27 Jul 2023 10:59:18 GMT
- Title: Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining
- Authors: Benjia Zhou and Zhigang Chen and Albert Clap\'es and Jun Wan and
Yanyan Liang and Sergio Escalera and Zhen Lei and Du Zhang
- Abstract summary: Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
- Score: 56.26550923909137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign Language Translation (SLT) is a challenging task due to its cross-domain
nature, involving the translation of visual-gestural language to text. Many
previous methods employ an intermediate representation, i.e., gloss sequences,
to facilitate SLT, thus transforming it into a two-stage task of sign language
recognition (SLR) followed by sign language translation (SLT). However, the
scarcity of gloss-annotated sign language data, combined with the information
bottleneck in the mid-level gloss representation, has hindered the further
development of the SLT task. To address this challenge, we propose a novel
Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves
SLT by inheriting language-oriented prior knowledge from pre-trained models,
without any gloss annotation assistance. Our approach involves two stages: (i)
integrating Contrastive Language-Image Pre-training (CLIP) with masked
self-supervised learning to create pre-tasks that bridge the semantic gap
between visual and textual representations and restore masked sentences, and
(ii) constructing an end-to-end architecture with an encoder-decoder-like
structure that inherits the parameters of the pre-trained Visual Encoder and
Text Decoder from the first stage. The seamless combination of these novel
designs forms a robust sign language representation and significantly improves
gloss-free sign language translation. In particular, we have achieved
unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset
(>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free
SLT methods. Furthermore, our approach also achieves competitive results on the
PHOENIX14T dataset when compared with most of the gloss-based methods. Our code
is available at https://github.com/zhoubenjia/GFSLT-VLP.
Related papers
- C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval [37.12863427950066]
We introduce an innovative pretraining paradigm for gloss-free SLRL, called C$2$RL.
C$2$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign.
It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign.
arXiv Detail & Related papers (2024-08-19T12:42:10Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Changing the Representation: Examining Language Representation for
Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline.
We use language models such as BERT and Word2Vec to create better sentence level embeddings.
We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z) - Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.