Pose-based Sign Language Recognition using GCN and BERT
- URL: http://arxiv.org/abs/2012.00781v1
- Date: Tue, 1 Dec 2020 19:10:50 GMT
- Title: Pose-based Sign Language Recognition using GCN and BERT
- Authors: Anirudh Tunga, Sai Vidyaranya Nuthalapati, Juan Wachs
- Abstract summary: Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language recognition (SLR) plays a crucial role in bridging the
communication gap between the hearing and vocally impaired community and the
rest of the society. Word-level sign language recognition (WSLR) is the first
important step towards understanding and interpreting sign language. However,
recognizing signs from videos is a challenging task as the meaning of a word
depends on a combination of subtle body motions, hand configurations, and other
movements. Recent pose-based architectures for WSLR either model both the
spatial and temporal dependencies among the poses in different frames
simultaneously or only model the temporal information without fully utilizing
the spatial information.
We tackle the problem of WSLR using a novel pose-based approach, which
captures spatial and temporal information separately and performs late fusion.
Our proposed architecture explicitly captures the spatial interactions in the
video using a Graph Convolutional Network (GCN). The temporal dependencies
between the frames are captured using Bidirectional Encoder Representations
from Transformers (BERT). Experimental results on WLASL, a standard word-level
sign language recognition dataset show that our model significantly outperforms
the state-of-the-art on pose-based methods by achieving an improvement in the
prediction accuracy by up to 5%.
Related papers
- Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation [2.6311088262657907]
This work proposes an Isolated Sign Language Recognition (ISLR) approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images.
We show that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS)
In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
arXiv Detail & Related papers (2024-04-29T23:21:17Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Signing at Scale: Learning to Co-Articulate Signs for Large-Scale
Photo-Realistic Sign Language Production [43.45785951443149]
Sign languages are visual languages, with vocabularies as rich as their spoken language counterparts.
Current deep-learning based Sign Language Production (SLP) models produce under-articulated skeleton pose sequences.
We tackle large-scale SLP by learning to co-articulate between dictionary signs.
We also propose SignGAN, a pose-conditioned human synthesis model that produces photo-realistic sign language videos.
arXiv Detail & Related papers (2022-03-29T08:51:38Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.