Sign Language Production with Latent Motion Transformer
- URL: http://arxiv.org/abs/2312.12917v1
- Date: Wed, 20 Dec 2023 10:53:06 GMT
- Title: Sign Language Production with Latent Motion Transformer
- Authors: Pan Xie, Taiyi Peng, Yao Du, Qipeng Zhang
- Abstract summary: We develop a new method to make high-quality sign videos without using human poses as a middle step.
Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features.
Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
- Score: 2.184775414778289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign Language Production (SLP) is the tough task of turning sign language
into sign videos. The main goal of SLP is to create these videos using a sign
gloss. In this research, we've developed a new method to make high-quality sign
videos without using human poses as a middle step. Our model works in two main
parts: first, it learns from a generator and the video's hidden features, and
next, it uses another model to understand the order of these hidden features.
To make this method even better for sign videos, we make several significant
improvements. (i) In the first stage, we take an improved 3D VQ-GAN to learn
downsampled latent representations. (ii) In the second stage, we introduce
sequence-to-sequence attention to better leverage conditional information.
(iii) The separated two-stage training discards the realistic visual semantic
of the latent codes in the second stage. To endow the latent sequences semantic
information, we extend the token-level autoregressive latent codes learning
with perceptual loss and reconstruction loss for the prior model with visual
perception. Compared with previous state-of-the-art approaches, our model
performs consistently better on two word-level sign language datasets, i.e.,
WLASL and NMFs-CSL.
Related papers
- Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language.
We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier.
We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally.
Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - A Transformer-Based Contrastive Learning Approach for Few-Shot Sign
Language Recognition [0.0]
We propose a novel Contrastive Transformer-based model, which demonstrate to learn rich representations from body key points sequences.
Experiments showed that the model could generalize well and achieved competitive results for sign classes never seen in the training process.
arXiv Detail & Related papers (2022-04-05T11:42:55Z) - SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
Language Recognition [94.30084702921529]
Hand gesture serves as a critical role in sign language.
Current deep-learning-based sign language recognition methods may suffer insufficient interpretability.
We introduce the first self-supervised pre-trainable SignBERT with incorporated hand prior for SLR.
arXiv Detail & Related papers (2021-10-11T16:18:09Z) - Continuous 3D Multi-Channel Sign Language Production via Progressive
Transformers and Mixture Density Networks [37.679114155300084]
Sign Language Production (SLP) must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community.
We propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D sign pose sequences.
We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences.
arXiv Detail & Related papers (2021-03-11T22:11:17Z) - Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign
Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer.
We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language.
A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.