Transferring Cross-domain Knowledge for Video Sign Language Recognition
- URL: http://arxiv.org/abs/2003.03703v2
- Date: Tue, 17 Mar 2020 14:53:06 GMT
- Title: Transferring Cross-domain Knowledge for Video Sign Language Recognition
- Authors: Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li
- Abstract summary: Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
- Score: 103.9216648495958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word-level sign language recognition (WSLR) is a fundamental task in sign
language interpretation. It requires models to recognize isolated sign words
from videos. However, annotating WSLR data needs expert knowledge, thus
limiting WSLR dataset acquisition. On the contrary, there are abundant
subtitled sign news videos on the internet. Since these videos have no
word-level annotation and exhibit a large domain gap from isolated signs, they
cannot be directly used for training WSLR models. We observe that despite the
existence of a large domain gap, isolated and news signs share the same visual
concepts, such as hand gestures and body movements. Motivated by this
observation, we propose a novel method that learns domain-invariant visual
concepts and fertilizes WSLR models by transferring knowledge of subtitled news
sign to them. To this end, we extract news signs using a base WSLR model, and
then design a classifier jointly trained on news and isolated signs to coarsely
align these two domain features. In order to learn domain-invariant features
within each class and suppress domain-specific features, our method further
resorts to an external memory to store the class centroids of the aligned news
signs. We then design a temporal attention based on the learnt descriptor to
improve recognition performance. Experimental results on standard WSLR datasets
show that our method outperforms previous state-of-the-art methods
significantly. We also demonstrate the effectiveness of our method on
automatically localizing signs from sign news, achieving 28.1 for AP@0.5.
Related papers
- Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition.
We first build two sign language dictionaries containing isolated signs that appear in two datasets.
Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z) - Learning from What is Already Out There: Few-shot Sign Language
Recognition with Online Dictionaries [0.0]
We open-source the UWB-SL-Wild few-shot dataset, the first of its kind training resource consisting of dictionary-scraped videos.
We introduce a novel approach to training sign language recognition models in a few-shot scenario, resulting in state-of-the-art results.
arXiv Detail & Related papers (2023-01-10T03:21:01Z) - Two-Stream Network for Sign Language Recognition and Translation [38.43767031555092]
We introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences.
The resulting model is called TwoStream- SLR, which is competent for sign language recognition.
TwoStream-SLT is extended to a sign language translation model, TwoStream-SLT, by simply attaching an extra translation network.
arXiv Detail & Related papers (2022-11-02T17:59:58Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues [106.21067543021887]
We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
arXiv Detail & Related papers (2020-07-23T16:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.