Towards Automatic Speech to Sign Language Generation
- URL: http://arxiv.org/abs/2106.12790v1
- Date: Thu, 24 Jun 2021 06:44:19 GMT
- Title: Towards Automatic Speech to Sign Language Generation
- Authors: Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri,
C V Jawahar
- Abstract summary: We propose a multi-language transformer network trained to generate signer's poses from speech segments.
Our model learns to generate continuous sign pose sequences in an end-to-end manner.
- Score: 35.22004819666906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to solve the highly challenging task of generating continuous sign
language videos solely from speech segments for the first time. Recent efforts
in this space have focused on generating such videos from human-annotated text
transcripts without considering other modalities. However, replacing speech
with sign language proves to be a practical solution while communicating with
people suffering from hearing loss. Therefore, we eliminate the need of using
text as input and design techniques that work for more natural, continuous,
freely uttered speech covering an extensive vocabulary. Since the current
datasets are inadequate for generating sign language directly from speech, we
collect and release the first Indian sign language dataset comprising
speech-level annotations, text transcripts, and the corresponding sign-language
videos. Next, we propose a multi-tasking transformer network trained to
generate signer's poses from speech segments. With speech-to-text as an
auxiliary task and an additional cross-modal discriminator, our model learns to
generate continuous sign pose sequences in an end-to-end manner. Extensive
experiments and comparisons with other baselines demonstrate the effectiveness
of our approach. We also conduct additional ablation studies to analyze the
effect of different modules of our network. A demo video containing several
results is attached to the supplementary material.
Related papers
- MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users.
A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step.
Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z) - SignCLIP: Connecting Text and Sign Language by Contrastive Learning [39.72545568965546]
SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs.
We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of 500 thousand video clips in up to 44 sign languages.
We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights.
arXiv Detail & Related papers (2024-07-01T13:17:35Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Topic Detection in Continuous Sign Language Videos [23.43298383445439]
We introduce the novel task of sign language topic detection.
We base our experiments on How2Sign, a large-scale video dataset spanning multiple semantic domains.
arXiv Detail & Related papers (2022-09-01T19:17:35Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.