GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning
- URL: http://arxiv.org/abs/2506.07460v1
- Date: Mon, 09 Jun 2025 06:09:03 GMT
- Title: GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning
- Authors: Taeryung Lee, Hyeongjin Nam, Gyeongsik Moon, Kyoung Mu Lee,
- Abstract summary: GLOS is a sign language generation framework with temporally aligned gloss-level conditioning.<n>Our method generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
- Score: 60.86278956347739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
Related papers
- AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition [0.0]
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
arXiv Detail & Related papers (2025-07-26T07:28:33Z) - Sign Spotting Disambiguation using Large Language Models [29.79050316749927]
We introduce a training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality.<n>Our approach extracts global-temporal and hand shape features, which are then matched against a large-scale sign dictionary.<n>This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining.
arXiv Detail & Related papers (2025-07-04T16:38:09Z) - StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation [33.695308849489784]
We propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs.<n>Specifically, we train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton.<n>We design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features.
arXiv Detail & Related papers (2025-06-16T07:09:51Z) - Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z) - Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.<n>Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.<n>We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.<n>We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z) - A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production [9.065171626657818]
This paper addresses the challenges associated with the use of glosses in Sign Language Translation (SLT) and Sign Language Production Language (SLP)<n>We introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture thetemporal inherent sign language.<n>Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches performance of previous SLT and SLP methods.
arXiv Detail & Related papers (2024-07-03T07:12:36Z) - Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2.
Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction.
This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Token-level Sequence Labeling for Spoken Language Understanding using
Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems.
By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations.
Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.