SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences
- URL: http://arxiv.org/abs/2405.02977v1
- Date: Sun, 5 May 2024 15:50:02 GMT
- Title: SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences
- Authors: Ali Emre Keskin, Hacer Yalim Keles,
- Abstract summary: We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset.
We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements.
The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation.
- Score: 2.0257616108612373
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.
Related papers
- Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition [0.20075899678041528]
We introduce a large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure.
The dataset covers 2002 daily used common words in the deaf community recorded by 20 (10 male and 10 female) deaf adult signers.
We propose a SL recognition model namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph.
arXiv Detail & Related papers (2024-07-19T11:48:36Z) - Zero-Shot Text Classification via Self-Supervised Tuning [46.9902502503747]
We propose a new paradigm based on self-supervised learning to solve zero-shot text classification tasks.
tuning the language models with unlabeled data, called self-supervised tuning.
Our model outperforms the state-of-the-art baselines on 7 out of 10 tasks.
arXiv Detail & Related papers (2023-05-19T05:47:33Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Isolated Sign Language Recognition based on Tree Structure Skeleton
Images [2.179313476241343]
We use Tree Dense Structure Image (TSSI) as an alternative input to improve the accuracy of skeleton-based models for sign recognition.
We trained a SignNet-121 using this type of input and compared it with other skeleton-based deep learning methods.
Our model (SL-TSSI-DenseNet) overcomes the state-of-the-art of other skeleton-based models.
arXiv Detail & Related papers (2023-04-10T01:58:50Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into
Sign Language Production [37.679114155300084]
Recent approaches to Sign Language Production (SLP) have adopted spoken language Neural Machine Translation (NMT) architectures, applied without sign-specific modifications.
In this paper, we represent sign language sequences as a skeletal graph structure, with joints as nodes and both spatial and temporal connections as edges.
We propose Skeletal Graph Self-Attention (SGSA), a novel graphical attention layer that embeds a skeleton bias into the SLP model.
arXiv Detail & Related papers (2021-12-06T10:12:11Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and
Baseline Methods [6.320141734801679]
We present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark.
Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples.
We trained several deep learning based models and provide empirical evaluations using the benchmark.
arXiv Detail & Related papers (2020-08-03T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.