Multi-Modal Zero-Shot Sign Language Recognition
- URL: http://arxiv.org/abs/2109.00796v1
- Date: Thu, 2 Sep 2021 09:10:39 GMT
- Title: Multi-Modal Zero-Shot Sign Language Recognition
- Authors: Razieh Rastgoo, Kourosh Kiani, Sergio Escalera, Mohammad Sabokrou
- Abstract summary: We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
- Score: 51.07720650677784
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Zero-Shot Learning (ZSL) has rapidly advanced in recent years. Towards
overcoming the annotation bottleneck in the Sign Language Recognition (SLR), we
explore the idea of Zero-Shot Sign Language Recognition (ZS-SLR) with no
annotated visual examples, by leveraging their textual descriptions. In this
way, we propose a multi-modal Zero-Shot Sign Language Recognition (ZS-SLR)
model harnessing from the complementary capabilities of deep features fused
with the skeleton-based ones. A Transformer-based model along with a C3D model
is used for hand detection and deep features extraction, respectively. To make
a trade-off between the dimensionality of the skeletonbased and deep features,
we use an Auto-Encoder (AE) on top of the Long Short Term Memory (LSTM)
network. Finally, a semantic space is used to map the visual features to the
lingual embedding of the class labels, achieved via the Bidirectional Encoder
Representations from Transformers (BERT) model. Results on four large-scale
datasets, RKS-PERSIANSIGN, First-Person, ASLVID, and isoGD, show the
superiority of the proposed model compared to state-of-the-art alternatives in
ZS-SLR.
Related papers
- FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos [49.337912335944026]
We formulate the problem of Zero-Shot Sign Language Recognition (ZS- SLR) and propose a two-stream model from two input modalities: RGB and Depth videos.
To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation.
Atemporal representation from human body is obtained using vision Transformer and a LSTM network.
arXiv Detail & Related papers (2021-08-23T10:48:18Z) - Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning [11.66422653137002]
We propose an attention-based model in the problem settings of Zero-Shot Learning to learn attributes useful for unseen class recognition.
Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches.
arXiv Detail & Related papers (2021-07-30T19:08:44Z) - PiSLTRc: Position-informed Sign Language Transformer with Content-aware
Convolution [0.42970700836450487]
We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics.
We explicitly select relevant features using a novel content-aware neighborhood gathering method.
We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation.
Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
arXiv Detail & Related papers (2021-07-27T05:01:27Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.