A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production
- URL: http://arxiv.org/abs/2407.02854v2
- Date: Wed, 04 Dec 2024 13:41:11 GMT
- Title: A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production
- Authors: Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, Jong C. Park,
- Abstract summary: This paper addresses the challenges associated with the use of glosses in Sign Language Translation (SLT) and Sign Language Production Language (SLP)
We introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture thetemporal inherent sign language.
Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches performance of previous SLT and SLP methods.
- Score: 9.065171626657818
- License:
- Abstract: This work addresses the challenges associated with the use of glosses in both Sign Language Translation (SLT) and Sign Language Production (SLP). While glosses have long been used as a bridge between sign language and spoken language, they come with two major limitations that impede the advancement of sign language systems. First, annotating the glosses is a labor-intensive and time-consuming process, which limits the scalability of datasets. Second, the glosses oversimplify sign language by stripping away its spatio-temporal dynamics, reducing complex signs to basic labels and missing the subtle movements essential for precise interpretation. To address these limitations, we introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture the spatio-temporal features inherent in sign language, providing a more dynamic and detailed alternative to the use of the glosses. The core idea of UniGloR is simple yet effective: We derive dense spatio-temporal representations from sign keypoint sequences using self-supervised learning and seamlessly integrate them into SLT and SLP tasks. Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches the performance of previous SLT and SLP methods on two widely-used datasets: PHOENIX14T and How2Sign.
Related papers
- An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.
This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.
We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z) - C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval [37.12863427950066]
We introduce an innovative pretraining paradigm for gloss-free SLRL, called C$2$RL.
C$2$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign.
It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign.
arXiv Detail & Related papers (2024-08-19T12:42:10Z) - Scaling up Multimodal Pre-training for Sign Language Understanding [96.17753464544604]
Sign language serves as the primary meaning of communication for the deaf-mute community.
To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied.
These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos.
arXiv Detail & Related papers (2024-08-16T06:04:25Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier.
We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally.
Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z) - Gloss-Free End-to-End Sign Language Translation [59.28829048788345]
We design the Gloss-Free End-to-end sign language translation framework (GloFE)
Our method improves the performance of SLT in the gloss-free setting by exploiting the shared underlying semantics of signs and the corresponding spoken translation.
We obtained state-of-the-art results on large-scale datasets, including OpenASL and How2Sign.
arXiv Detail & Related papers (2023-05-22T09:57:43Z) - Natural Language-Assisted Sign Language Recognition [28.64871971445024]
We propose the Natural Language-Assisted Sign Language Recognition framework.
It exploits semantic information contained in glosses (sign labels) to mitigate the problem of visually indistinguishable signs (VISigns) in sign languages.
Our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - Changing the Representation: Examining Language Representation for
Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline.
We use language models such as BERT and Word2Vec to create better sentence level embeddings.
We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z) - Data Augmentation for Sign Language Gloss Translation [115.13684506803529]
Sign language translation (SLT) is often decomposed into video-to-gloss recognition and gloss-totext translation.
We focus here on gloss-to-text translation, which we treat as a low-resource neural machine translation (NMT) problem.
By pre-training on the thus obtained synthetic data, we improve translation from American Sign Language (ASL) to English and German Sign Language (DGS) to German by up to 3.14 and 2.20 BLEU, respectively.
arXiv Detail & Related papers (2021-05-16T16:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.