CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
Learning
- URL: http://arxiv.org/abs/2303.12793v1
- Date: Wed, 22 Mar 2023 17:59:59 GMT
- Title: CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
Learning
- Authors: Yiting Cheng, Fangyun Wei, Jianmin Bao, Dong Chen, Wenqiang Zhang
- Abstract summary: Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval.
We take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual mappings.
Our framework outperforms the pioneering method by large margins on various datasets.
- Score: 38.83062453145388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work focuses on sign language retrieval-a recently proposed task for
sign language understanding. Sign language retrieval consists of two sub-tasks:
text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval.
Different from traditional video-text retrieval, sign language videos, not only
contain visual signals but also carry abundant semantic meanings by themselves
due to the fact that sign languages are also natural languages. Considering
this character, we formulate sign language retrieval as a cross-lingual
retrieval problem as well as a video-text retrieval task. Concretely, we take
into account the linguistic properties of both sign languages and natural
languages, and simultaneously identify the fine-grained cross-lingual (i.e.,
sign-to-word) mappings while contrasting the texts and the sign videos in a
joint embedding space. This process is termed as cross-lingual contrastive
learning. Another challenge is raised by the data scarcity issue-sign language
datasets are orders of magnitude smaller in scale than that of speech
recognition. We alleviate this issue by adopting a domain-agnostic sign encoder
pre-trained on large-scale sign videos into the target domain via
pseudo-labeling. Our framework, termed as domain-aware sign language retrieval
via Cross-lingual Contrastive learning or CiCo for short, outperforms the
pioneering method by large margins on various datasets, e.g., +22.4 T2V and
+28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1
improvements on PHOENIX-2014T dataset. Code and models are available at:
https://github.com/FangyunWei/SLRT.
Related papers
- SignCLIP: Connecting Text and Sign Language by Contrastive Learning [39.72545568965546]
SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs.
We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of 500 thousand video clips in up to 44 sign languages.
We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights.
arXiv Detail & Related papers (2024-07-01T13:17:35Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale [22.49602248323602]
A persistent challenge in sign language video processing is how we learn representations of sign language.
Our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer.
Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training.
arXiv Detail & Related papers (2024-06-11T03:00:41Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition.
We first build two sign language dictionaries containing isolated signs that appear in two datasets.
Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Topic Detection in Continuous Sign Language Videos [23.43298383445439]
We introduce the novel task of sign language topic detection.
We base our experiments on How2Sign, a large-scale video dataset spanning multiple semantic domains.
arXiv Detail & Related papers (2022-09-01T19:17:35Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.