Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal
- URL: http://arxiv.org/abs/2212.13023v2
- Date: Thu, 11 Jan 2024 14:54:29 GMT
- Title: Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal
- Authors: Ronglai Zuo and Brian Mak
- Abstract summary: We propose three auxiliary tasks to enhance the CSLR backbones.
A keypoint-guided spatial attention module is developed to enforce the visual module.
A sentence embedding consistency constraint is imposed between the visual and sequential modules.
Our model achieves state-of-the-art or competitive performance on five benchmarks.
- Score: 24.537234147678113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most deep-learning-based continuous sign language recognition (CSLR) models
share a similar backbone consisting of a visual module, a sequential module,
and an alignment module. However, due to limited training samples, a
connectionist temporal classification loss may not train such CSLR backbones
sufficiently. In this work, we propose three auxiliary tasks to enhance the
CSLR backbones. The first task enhances the visual module, which is sensitive
to the insufficient training problem, from the perspective of consistency.
Specifically, since the information of sign languages is mainly included in
signers' facial expressions and hand movements, a keypoint-guided spatial
attention module is developed to enforce the visual module to focus on
informative regions, i.e., spatial attention consistency. Second, noticing that
both the output features of the visual and sequential modules represent the
same sentence, to better exploit the backbone's power, a sentence embedding
consistency constraint is imposed between the visual and sequential modules to
enhance the representation power of both features. We name the CSLR model
trained with the above auxiliary tasks as consistency-enhanced CSLR, which
performs well on signer-dependent datasets in which all signers appear during
both training and testing. To make it more robust for the signer-independent
setting, a signer removal module based on feature disentanglement is further
proposed to remove signer information from the backbone. Extensive ablation
studies are conducted to validate the effectiveness of these auxiliary tasks.
More remarkably, with a transformer-based backbone, our model achieves
state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014,
PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are
available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.
Related papers
- CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [56.199779065855004]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations.
Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z) - Continuous Sign Language Recognition Using Intra-inter Gloss Attention [0.0]
In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module.
In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk.
Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner.
arXiv Detail & Related papers (2024-06-26T13:21:08Z) - Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining [0.6144680854063939]
State-of-the-art Conformer model for Speech Recognition is adapted for continuous sign language recognition.
This marks the first instance of employing Conformer for a vision-based task.
Unsupervised pretraining is conducted on a curated sign language dataset.
arXiv Detail & Related papers (2024-05-20T13:40:52Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Improving Continuous Sign Language Recognition with Adapted Image Models [9.366498095041814]
Large-scale vision-language models (e.g., CLIP) have shown impressive generalization performance over a series of downstream tasks.
To enable high efficiency when adapting these large vision-language models to performing continuous sign language recognition, we propose a novel strategy (AdaptSign)
AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods.
arXiv Detail & Related papers (2024-04-12T03:43:37Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Self-Sufficient Framework for Continuous Sign Language Recognition [75.60327502570242]
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition.
These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations.
We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations.
DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
arXiv Detail & Related papers (2023-03-21T11:42:57Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.