Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation
- URL: http://arxiv.org/abs/2204.05953v3
- Date: Thu, 25 May 2023 10:21:56 GMT
- Title: Explore More Guidance: A Task-aware Instruction Network for Sign
Language Translation Enhanced with Data Augmentation
- Authors: Yong Cao, Wei Li, Xianzhi Li, Min Chen, Guangyong Chen, Long Hu,
Zhengdao Li, Hwang Kai
- Abstract summary: Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation.
- Score: 20.125265661134964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language recognition and translation first uses a recognition module to
generate glosses from sign language videos and then employs a translation
module to translate glosses into spoken sentences. Most existing works focus on
the recognition step, while paying less attention to sign language translation.
In this work, we propose a task-aware instruction network, namely TIN-SLT, for
sign language translation, by introducing the instruction module and the
learning-based feature fuse strategy into a Transformer network. In this way,
the pre-trained model's language ability can be well explored and utilized to
further boost the translation performance. Moreover, by exploring the
representation space of sign language glosses and target spoken language, we
propose a multi-level data augmentation scheme to adjust the data distribution
of the training set. We conduct extensive experiments on two challenging
benchmark datasets, PHOENIX-2014-T and ASLG-PC12, on which our method
outperforms former best solutions by 1.65 and 1.42 in terms of BLEU-4. Our code
is published at https://github.com/yongcaoplus/TIN-SLT.
Related papers
- Diverse Sign Language Translation [27.457810402402387]
We introduce a Diverse Sign Language Translation (DivSLT) task, aiming to generate diverse yet accurate translations for sign language videos.
We employ large language models (LLM) to generate multiple references for the widely-used CSL-Daily and PHOENIX14T SLT datasets.
Specifically, we investigate multi-reference training strategies to enable our DivSLT model to achieve diverse translations.
arXiv Detail & Related papers (2024-10-25T14:28:20Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation [30.008980708977095]
We introduce Sign2GPT, a novel framework for sign language translation.
We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses.
We evaluate our approach on two public benchmark sign language translation datasets.
arXiv Detail & Related papers (2024-05-07T10:00:38Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Changing the Representation: Examining Language Representation for
Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline.
We use language models such as BERT and Word2Vec to create better sentence level embeddings.
We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.