Continuous Sign Language Recognition Based on Motor attention mechanism
and frame-level Self-distillation
- URL: http://arxiv.org/abs/2402.19118v1
- Date: Thu, 29 Feb 2024 12:52:50 GMT
- Title: Continuous Sign Language Recognition Based on Motor attention mechanism
and frame-level Self-distillation
- Authors: Qidan Zhu, Jing Li, Fei Yuan, Quan Gan
- Abstract summary: We propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression.
For the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language.
- Score: 17.518587972114567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Changes in facial expression, head movement, body movement and gesture
movement are remarkable cues in sign language recognition, and most of the
current continuous sign language recognition(CSLR) research methods mainly
focus on static images in video sequences at the frame-level feature extraction
stage, while ignoring the dynamic changes in the images. In this paper, we
propose a novel motor attention mechanism to capture the distorted changes in
local motion regions during sign language expression, and obtain a dynamic
representation of image changes. And for the first time, we apply the
self-distillation method to frame-level feature extraction for continuous sign
language, which improves the feature expression without increasing the
computational resources by self-distilling the features of adjacent stages and
using the higher-order features as teachers to guide the lower-order features.
The combination of the two constitutes our proposed holistic model of CSLR
Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD),
which improves the inference ability and robustness of the model. We conduct
experiments on three publicly available datasets, and the experimental results
show that our proposed method can effectively extract the sign language motion
information in videos, improve the accuracy of CSLR and reach the
state-of-the-art level.
Related papers
- Audio-driven Gesture Generation via Deviation Feature in the Latent Space [2.8952735126314733]
We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.
Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.
Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
arXiv Detail & Related papers (2025-03-27T15:37:16Z) - Beyond RNNs: Benchmarking Attention-Based Image Captioning Models [0.0]
This study benchmarks the performance of attention-based image captioning models against RNN-based approaches.
We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions.
Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions.
arXiv Detail & Related papers (2025-02-26T01:05:18Z) - Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation [8.84657964527764]
We explore the representation of gestures in co-speech with a focus on self-supervised representation and pixel-level motion deviation.
Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation.
Results of our first experiment demonstrate that our method enhances the quality of generated videos.
arXiv Detail & Related papers (2024-09-26T09:33:20Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.