Related papers: A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

URL: http://arxiv.org/abs/2407.02854v2
Date: Wed, 04 Dec 2024 13:41:11 GMT
Title: A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production
Authors: Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, Jong C. Park,
Abstract summary: This paper addresses the challenges associated with the use of glosses in Sign Language Translation (SLT) and Sign Language Production Language (SLP)<n>We introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture thetemporal inherent sign language.<n>Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches performance of previous SLT and SLP methods.
Score: 9.065171626657818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work addresses the challenges associated with the use of glosses in both Sign Language Translation (SLT) and Sign Language Production (SLP). While glosses have long been used as a bridge between sign language and spoken language, they come with two major limitations that impede the advancement of sign language systems. First, annotating the glosses is a labor-intensive and time-consuming process, which limits the scalability of datasets. Second, the glosses oversimplify sign language by stripping away its spatio-temporal dynamics, reducing complex signs to basic labels and missing the subtle movements essential for precise interpretation. To address these limitations, we introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture the spatio-temporal features inherent in sign language, providing a more dynamic and detailed alternative to the use of the glosses. The core idea of UniGloR is simple yet effective: We derive dense spatio-temporal representations from sign keypoint sequences using self-supervised learning and seamlessly integrate them into SLT and SLP tasks. Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches the performance of previous SLT and SLP methods on two widely-used datasets: PHOENIX14T and How2Sign.

Related papers

GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning [60.86278956347739]
GLOS is a sign language generation framework with temporally aligned gloss-level conditioning.<n>Our method generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
arXiv Detail & Related papers (2025-06-09T06:09:03Z)
Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z)
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses. This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language. We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z)
C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval [37.12863427950066]
We introduce an innovative pretraining paradigm for gloss-free SLRL, called C$2$RL. C$2$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign.
arXiv Detail & Related papers (2024-08-19T12:42:10Z)
Scaling up Multimodal Pre-training for Sign Language Understanding [96.17753464544604]
Sign language serves as the primary meaning of communication for the deaf-mute community. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos.
arXiv Detail & Related papers (2024-08-16T06:04:25Z)
Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation [30.008980708977095]
We introduce Sign2GPT, a novel framework for sign language translation. We propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses. We evaluate our approach on two public benchmark sign language translation datasets.
arXiv Detail & Related papers (2024-05-07T10:00:38Z)
SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning. It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone. It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier. We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z)
Gloss-Free End-to-End Sign Language Translation [59.28829048788345]
We design the Gloss-Free End-to-end sign language translation framework (GloFE) Our method improves the performance of SLT in the gloss-free setting by exploiting the shared underlying semantics of signs and the corresponding spoken translation. We obtained state-of-the-art results on large-scale datasets, including OpenASL and How2Sign.
arXiv Detail & Related papers (2023-05-22T09:57:43Z)
Natural Language-Assisted Sign Language Recognition [28.64871971445024]
We propose the Natural Language-Assisted Sign Language Recognition framework. It exploits semantic information contained in glosses (sign labels) to mitigate the problem of visually indistinguishable signs (VISigns) in sign languages. Our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL.
arXiv Detail & Related papers (2023-03-21T17:59:57Z)
Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks. We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking. We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z)
Changing the Representation: Examining Language Representation for Neural Sign Language Production [43.45785951443149]
We apply Natural Language Processing techniques to the first step of the Neural Sign Language Production pipeline. We use language models such as BERT and Word2Vec to create better sentence level embeddings. We introduce Text to HamNoSys (T2H) translation, and show the advantages of using a phonetic representation for sign language translation.
arXiv Detail & Related papers (2022-09-16T12:45:29Z)
All You Need In Sign Language Production [50.3955314892191]
Sign language recognition and production need to cope with some critical challenges. We present an introduction to the Deaf culture, Deaf centers, psychological perspective of sign language. Also, the backbone architectures and methods in SLP are briefly introduced and the proposed taxonomy on SLP is presented.
arXiv Detail & Related papers (2022-01-05T13:45:09Z)
Data Augmentation for Sign Language Gloss Translation [115.13684506803529]
Sign language translation (SLT) is often decomposed into video-to-gloss recognition and gloss-totext translation. We focus here on gloss-to-text translation, which we treat as a low-resource neural machine translation (NMT) problem. By pre-training on the thus obtained synthetic data, we improve translation from American Sign Language (ASL) to English and German Sign Language (DGS) to German by up to 3.14 and 2.20 BLEU, respectively.
arXiv Detail & Related papers (2021-05-16T16:37:36Z)
Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals. This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z)
Adversarial Training for Multi-Channel Sign Language Production [43.45785951443149]
We propose an Adversarial Multi-Channel approach to Sign Language Production. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output.
arXiv Detail & Related papers (2020-08-27T23:05:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.