Linguistically Motivated Sign Language Segmentation
- URL: http://arxiv.org/abs/2310.13960v2
- Date: Mon, 30 Oct 2023 13:51:39 GMT
- Title: Linguistically Motivated Sign Language Segmentation
- Authors: Amit Moryossef, Zifan Jiang, Mathias M\"uller, Sarah Ebling, Yoav
Goldberg
- Abstract summary: We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
- Score: 51.06873383204105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language segmentation is a crucial task in sign language processing
systems. It enables downstream tasks such as sign recognition, transcription,
and machine translation. In this work, we consider two kinds of segmentation:
segmentation into individual signs and segmentation into phrases, larger units
comprising several signs. We propose a novel approach to jointly model these
two tasks.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for
continuous signing. Given that prosody plays a significant role in phrase
boundaries, we explore the use of optical flow features. We also provide an
extensive analysis of hand shapes and 3D hand normalization.
We find that introducing BIO tagging is necessary to model sign boundaries.
Explicitly encoding prosody by optical flow improves segmentation in shallow
models, but its contribution is negligible in deeper models. Careful tuning of
the decoding algorithm atop the models further improves the segmentation
quality.
We demonstrate that our final models generalize to out-of-domain video
content in a different signed language, even under a zero-shot setting. We
observe that including optical flow and 3D hand normalization enhances the
robustness of the model in this context.
Related papers
- SignAttention: On the Interpretability of Transformer Models for Sign Language Translation [2.079808290618441]
This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation model.
We examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses.
This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems.
arXiv Detail & Related papers (2024-10-18T14:38:37Z) - MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users.
A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step.
Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale [22.49602248323602]
A persistent challenge in sign language video processing is how we learn representations of sign language.
Our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer.
Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training.
arXiv Detail & Related papers (2024-06-11T03:00:41Z) - SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation [3.9711029428461653]
We introduce a new task named multi-channel sign language translation (MCSLT)
We present a novel metric, SignBLEU, designed to capture multiple signal channels.
We found that SignBLEU consistently correlates better with human judgment than competing metrics.
arXiv Detail & Related papers (2024-06-10T05:01:26Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.