Self-Sufficient Framework for Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2303.11771v1
- Date: Tue, 21 Mar 2023 11:42:57 GMT
- Title: Self-Sufficient Framework for Continuous Sign Language Recognition
- Authors: Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin
Kim, In So Kweon, Joon Son Chung
- Abstract summary: The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition.
These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations.
We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations.
DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
- Score: 75.60327502570242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to develop self-sufficient framework for Continuous
Sign Language Recognition (CSLR) that addresses key issues of sign language
recognition. These include the need for complex multi-scale features such as
hands, face, and mouth for understanding, and absence of frame-level
annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv)
which extracts both manual and non-manual features without the need for
additional networks or annotations, and (2) Dense Pseudo-Label Refinement
(DPLR) which propagates non-spiky frame-level pseudo-labels by combining the
ground truth gloss sequence labels with the predicted sequence. We demonstrate
that our model achieves state-of-the-art performance among RGB-based methods on
large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing
comparable results with better efficiency when compared to other approaches
that use multi-modality or extra annotations.
Related papers
- Leveraging Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning [0.0]
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction.
Current self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse.
We propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information.
arXiv Detail & Related papers (2025-02-02T07:42:45Z) - DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP.
We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Continuous Sign Language Recognition Using Intra-inter Gloss Attention [0.0]
In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module.
In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk.
Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner.
arXiv Detail & Related papers (2024-06-26T13:21:08Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal [24.537234147678113]
We propose three auxiliary tasks to enhance the CSLR backbones.
A keypoint-guided spatial attention module is developed to enforce the visual module.
A sentence embedding consistency constraint is imposed between the visual and sequential modules.
Our model achieves state-of-the-art or competitive performance on five benchmarks.
arXiv Detail & Related papers (2022-12-26T06:38:34Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.