Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse
- URL: http://arxiv.org/abs/2308.09622v1
- Date: Fri, 18 Aug 2023 15:27:22 GMT
- Title: Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse
- Authors: Ozge Mercanoglu Sincan, Necati Cihan Camgoz, Richard Bowden
- Abstract summary: Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
- Score: 34.70927441846784
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sign Language Translation (SLT) is a challenging task that aims to generate
spoken language sentences from sign language videos, both of which have
different grammar and word/gloss order. From a Neural Machine Translation (NMT)
perspective, the straightforward way of training translation models is to use
sign language phrase-spoken language sentence pairs. However, human
interpreters heavily rely on the context to understand the conveyed
information, especially for sign language interpretation, where the vocabulary
size may be significantly smaller than their spoken language equivalent.
Taking direct inspiration from how humans translate, we propose a novel
multi-modal transformer architecture that tackles the translation task in a
context-aware manner, as a human would. We use the context from previous
sequences and confident predictions to disambiguate weaker visual cues. To
achieve this we use complementary transformer encoders, namely: (1) A Video
Encoder, that captures the low-level video features at the frame-level, (2) A
Spotting Encoder, that models the recognized sign glosses in the video, and (3)
A Context Encoder, which captures the context of the preceding sign sequences.
We combine the information coming from these encoders in a final transformer
decoder to generate spoken language translations.
We evaluate our approach on the recently published large-scale BOBSL dataset,
which contains ~1.2M sequences, and on the SRF dataset, which was part of the
WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art
translation performance using contextual information, nearly doubling the
reported BLEU-4 scores of baseline approaches.
Related papers
- Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation [30.008980708977095]
We introduce Sign2GPT, a novel framework for sign language translation.
We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses.
We evaluate our approach on two public benchmark sign language translation datasets.
arXiv Detail & Related papers (2024-05-07T10:00:38Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation [20.125265661134964]
Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
arXiv Detail & Related papers (2022-04-12T17:09:44Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.