Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation
- URL: http://arxiv.org/abs/2003.13830v1
- Date: Mon, 30 Mar 2020 21:35:09 GMT
- Title: Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation
- Authors: Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden
- Abstract summary: We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
- Score: 59.38247587308604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work on Sign Language Translation has shown that having a mid-level
sign gloss representation (effectively recognizing the individual signs)
improves the translation performance drastically. In fact, the current
state-of-the-art in translation requires gloss level tokenization in order to
work. We introduce a novel transformer based architecture that jointly learns
Continuous Sign Language Recognition and Translation while being trainable in
an end-to-end manner. This is achieved by using a Connectionist Temporal
Classification (CTC) loss to bind the recognition and translation problems into
a single unified architecture. This joint approach does not require any
ground-truth timing information, simultaneously solving two co-dependant
sequence-to-sequence learning problems and leads to significant performance
gains.
We evaluate the recognition and translation performances of our approaches on
the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report
state-of-the-art sign language recognition and translation results achieved by
our Sign Language Transformers. Our translation networks outperform both sign
video to spoken language and gloss to spoken language translation models, in
some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We
also share new baseline translation results using transformer networks for
several other text-to-text sign language translation tasks.
Related papers
- From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey, Taxonomy and Performance Evaluation [0.0]
There is a pressing need for an efficient, signs-driven, integrated end-to-end translation system.
We present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms.
arXiv Detail & Related papers (2024-08-27T07:11:45Z) - Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation [30.008980708977095]
We introduce Sign2GPT, a novel framework for sign language translation.
We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses.
We evaluate our approach on two public benchmark sign language translation datasets.
arXiv Detail & Related papers (2024-05-07T10:00:38Z) - Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.