Related papers: Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

URL: http://arxiv.org/abs/2411.17799v1
Date: Tue, 26 Nov 2024 18:28:09 GMT
Title: Signs as Tokens: An Autoregressive Multilingual Sign Language Generator
Authors: Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou,
Abstract summary: We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts.<n>These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets.
Score: 55.94334001112357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at https://2000zrl.github.io/soke/.

Related papers

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction [65.1590372072555]
SHuBERT (Sign Hidden-Unit BERT) is a self-supervised contextual representation model learned from 1,000 hours of American Sign Language video.<n>SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams.<n>SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.
arXiv Detail & Related papers (2024-11-25T03:13:08Z)
Scaling up Multimodal Pre-training for Sign Language Understanding [96.17753464544604]
Sign language serves as the primary meaning of communication for the deaf-mute community. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos.
arXiv Detail & Related papers (2024-08-16T06:04:25Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
SignCLIP: Connecting Text and Sign Language by Contrastive Learning [39.72545568965546]
SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of 500 thousand video clips in up to 44 sign languages. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights.
arXiv Detail & Related papers (2024-07-01T13:17:35Z)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency. Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z)
T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language. Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z)
SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale [22.49602248323602]
A persistent challenge in sign language video processing is how we learn representations of sign language. Our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training.
arXiv Detail & Related papers (2024-06-11T03:00:41Z)
SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation [3.9711029428461653]
We introduce a new task named multi-channel sign language translation (MCSLT) We present a novel metric, SignBLEU, designed to capture multiple signal channels. We found that SignBLEU consistently correlates better with human judgment than competing metrics.
arXiv Detail & Related papers (2024-06-10T05:01:26Z)
A Data-Driven Representation for Sign Language Production [26.520016084139964]
Sign Language Production aims to automatically translate spoken language sentences into continuous sequences of sign language. Current state-of-the-art approaches rely on scarce linguistic resources to work. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem.
arXiv Detail & Related papers (2024-04-17T15:52:38Z)
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars [49.60328609426056]
Spoken2Sign is a system for translating spoken languages into sign languages. We present a simple baseline consisting of three steps: creating a gloss-video dictionary, estimating a 3D sign for each sign video, and training a Spoken2Sign model. As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs.
arXiv Detail & Related papers (2024-01-09T18:59:49Z)
Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition. We first build two sign language dictionaries containing isolated signs that appear in two datasets. Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z)
Learnt Contrastive Concept Embeddings for Sign Recognition [33.72708697077754]
We focus on explicitly creating sign embeddings that bridge the gap between sign language and spoken language. We train a vocabulary of embeddings that are based on the linguistic labels for sign video. We develop a conceptual similarity loss which is able to utilise word embeddings from NLP methods to create sign embeddings that have better sign language to spoken language correspondence.
arXiv Detail & Related papers (2023-08-18T12:47:18Z)
Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts. Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences. We tackle large-scale SLP by learning to co-articulate between dictionary signs. We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z)
Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate. Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance. Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.