Two-Stream Network for Sign Language Recognition and Translation
- URL: http://arxiv.org/abs/2211.01367v2
- Date: Thu, 23 Mar 2023 02:49:35 GMT
- Title: Two-Stream Network for Sign Language Recognition and Translation
- Authors: Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, Brian Mak
- Abstract summary: We introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences.
The resulting model is called TwoStream- SLR, which is competent for sign language recognition.
TwoStream-SLT is extended to a sign language translation model, TwoStream-SLT, by simply attaching an extra translation network.
- Score: 38.43767031555092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign languages are visual languages using manual articulations and non-manual
elements to convey information. For sign language recognition and translation,
the majority of existing approaches directly encode RGB videos into hidden
representations. RGB videos, however, are raw signals with substantial visual
redundancy, leading the encoder to overlook the key information for sign
language understanding. To mitigate this problem and better incorporate domain
knowledge, such as handshape and body movement, we introduce a dual visual
encoder containing two separate streams to model both the raw videos and the
keypoint sequences generated by an off-the-shelf keypoint estimator. To make
the two streams interact with each other, we explore a variety of techniques,
including bidirectional lateral connection, sign pyramid network with auxiliary
supervision, and frame-level self-distillation. The resulting model is called
TwoStream-SLR, which is competent for sign language recognition (SLR).
TwoStream-SLR is extended to a sign language translation (SLT) model,
TwoStream-SLT, by simply attaching an extra translation network.
Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art
performance on SLR and SLT tasks across a series of datasets including
Phoenix-2014, Phoenix-2014T, and CSL-Daily. Code and models are available at:
https://github.com/FangyunWei/SLRT.
Related papers
- SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background.
We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator.
We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z) - Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language.
We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z) - Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z) - CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
Learning [38.83062453145388]
Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval.
We take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual mappings.
Our framework outperforms the pioneering method by large margins on various datasets.
arXiv Detail & Related papers (2023-03-22T17:59:59Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.