ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos
- URL: http://arxiv.org/abs/2108.10059v1
- Date: Mon, 23 Aug 2021 10:48:18 GMT
- Title: ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos
- Authors: Razieh Rastgoo, Kourosh Kiani, Sergio Escalera
- Abstract summary: We formulate the problem of Zero-Shot Sign Language Recognition (ZS- SLR) and propose a two-stream model from two input modalities: RGB and Depth videos.
To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation.
Atemporal representation from human body is obtained using vision Transformer and a LSTM network.
- Score: 49.337912335944026
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sign Language Recognition (SLR) is a challenging research area in computer
vision. To tackle the annotation bottleneck in SLR, we formulate the problem of
Zero-Shot Sign Language Recognition (ZS-SLR) and propose a two-stream model
from two input modalities: RGB and Depth videos. To benefit from the vision
Transformer capabilities, we use two vision Transformer models, for human
detection and visual features representation. We configure a transformer
encoder-decoder architecture, as a fast and accurate human detection model, to
overcome the challenges of the current human detection models. Considering the
human keypoints, the detected human body is segmented into nine parts. A
spatio-temporal representation from human body is obtained using a vision
Transformer and a LSTM network. A semantic space maps the visual features to
the lingual embedding of the class labels via a Bidirectional Encoder
Representations from Transformers (BERT) model. We evaluated the proposed model
on four datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, and NTU-60,
obtaining state-of-the-art results compared to state-of-the-art ZS-SLR models.
Related papers
- A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Comparative study of Transformer and LSTM Network with attention
mechanism on Image Captioning [0.0]
This study compares Transformer and LSTM with attention block model on MS-COCO dataset.
Transformer and LSTM with attention block models have been discussed with state of the art accuracy.
arXiv Detail & Related papers (2023-03-05T11:45:53Z) - Two-Stream Network for Sign Language Recognition and Translation [38.43767031555092]
We introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences.
The resulting model is called TwoStream- SLR, which is competent for sign language recognition.
TwoStream-SLT is extended to a sign language translation model, TwoStream-SLT, by simply attaching an extra translation network.
arXiv Detail & Related papers (2022-11-02T17:59:58Z) - An Empirical Study Of Self-supervised Learning Approaches For Object
Detection With Transformers [0.0]
We explore self-supervised methods based on image reconstruction, masked image modeling and jigsaw.
Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings.
arXiv Detail & Related papers (2022-05-11T14:39:27Z) - Self-supervised Vision Transformers for Joint SAR-optical Representation
Learning [19.316112344900638]
Self-supervised learning (SSL) has attracted much interest in remote sensing and earth observation.
We explore the potential of vision transformers (ViTs) for joint SAR-optical representation learning.
Based on DINO, a state-of-the-art SSL algorithm, we combine SAR and optical imagery by concatenating all channels to a unified input.
arXiv Detail & Related papers (2022-04-11T19:42:53Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.