VK-G2T: Vision and Context Knowledge enhanced Gloss2Text
- URL: http://arxiv.org/abs/2312.10210v1
- Date: Fri, 15 Dec 2023 21:09:34 GMT
- Title: VK-G2T: Vision and Context Knowledge enhanced Gloss2Text
- Authors: Liqiang Jing, Xuemeng Song, Xinxing Zu, Na Zheng, Zhongzhou Zhao,
Liqiang Nie
- Abstract summary: Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text)
We propose a vision and context knowledge enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of the sign language video to learn the properties of the target sentence and exploit the context knowledge to facilitate the adaptive translation of gloss words.
- Score: 60.57628465740138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing sign language translation methods follow a two-stage pipeline: first
converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and
then translating the generated gloss sequence into a spoken language sentence
(i.e. Gloss2Text). While previous studies have focused on boosting the
performance of the Sign2Gloss stage, we emphasize the optimization of the
Gloss2Text stage. However, this task is non-trivial due to two distinct
features of Gloss2Text: (1) isolated gloss input and (2) low-capacity gloss
vocabulary. To address these issues, we propose a vision and context knowledge
enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of
the sign language video to learn the properties of the target sentence and
exploit the context knowledge to facilitate the adaptive translation of gloss
words. Extensive experiments conducted on a Chinese benchmark validate the
superiority of our model.
Related papers
- Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing [21.183453511034767]
We propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function.
Our approach surpasses state-of-the-art performance in em Gloss2Text translation.
arXiv Detail & Related papers (2024-07-01T15:46:45Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier.
We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally.
Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z) - Changing the Representation: Examining Language Representation for
Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline.
We use language models such as BERT and Word2Vec to create better sentence level embeddings.
We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z) - Visual Keyword Spotting with Attention [82.79015266453533]
We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
arXiv Detail & Related papers (2021-10-29T17:59:04Z) - Data Augmentation for Sign Language Gloss Translation [115.13684506803529]
Sign language translation (SLT) is often decomposed into video-to-gloss recognition and gloss-totext translation.
We focus here on gloss-to-text translation, which we treat as a low-resource neural machine translation (NMT) problem.
By pre-training on the thus obtained synthetic data, we improve translation from American Sign Language (ASL) to English and German Sign Language (DGS) to German by up to 3.14 and 2.20 BLEU, respectively.
arXiv Detail & Related papers (2021-05-16T16:37:36Z) - Better Sign Language Translation with STMC-Transformer [9.835743237370218]
Sign Language Translation first uses a Sign Language Recognition system to extract sign language glosses from videos.
A translation system then generates spoken language translations from the sign language glosses.
This paper introduces the STMC-Transformer which improves on the current state-of-the-art by over 5 and 7 BLEU respectively.
arXiv Detail & Related papers (2020-04-01T17:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.