Temporal superimposed crossover module for effective continuous sign
language
- URL: http://arxiv.org/abs/2211.03387v3
- Date: Sat, 1 Apr 2023 10:34:13 GMT
- Title: Temporal superimposed crossover module for effective continuous sign
language
- Authors: Qidan Zhu, Jing Li, Fei Yuan, Quan Gan
- Abstract summary: This paper proposes a zero parameter, zero temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution.
Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
- Score: 10.920363368754721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ultimate goal of continuous sign language recognition(CSLR) is to
facilitate the communication between special people and normal people, which
requires a certain degree of real-time and deploy-ability of the model.
However, in the previous research on CSLR, little attention has been paid to
the real-time and deploy-ability. In order to improve the real-time and
deploy-ability of the model, this paper proposes a zero parameter, zero
computation temporal superposition crossover module(TSCM), and combines it with
2D convolution to form a "TSCM+2D convolution" hybrid convolution, which
enables 2D convolution to have strong spatial-temporal modelling capability
with zero parameter increase and lower deployment cost compared with other
spatial-temporal convolutions. The overall CSLR model based on TSCM is built on
the improved ResBlockT network in this paper. The hybrid convolution of
"TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form
the new ResBlockT, and random gradient stop and multi-level CTC loss are
introduced to train the model, which reduces the final recognition WER while
reducing the training memory usage, and extends the ResNet network from image
classification task to video recognition task. In addition, this study is the
first in CSLR to use only 2D convolution extraction of sign language video
temporal-spatial features for end-to-end learning for recognition. Experiments
on two large-scale continuous sign language datasets demonstrate the
effectiveness of the proposed method and achieve highly competitive results.
Related papers
- CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration [2.400446821380503]
Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud.
Most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization.
We propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose.
arXiv Detail & Related papers (2024-08-05T11:40:59Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs)
Existing binarization methods result in significant performance degradation.
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Continuous Sign Language Recognition via Temporal Super-Resolution
Network [10.920363368754721]
This paper aims at the problem that the spatial-temporal hierarchical continuous sign language recognition model based on deep learning has a large amount of computation.
The data is reconstructed into a dense feature sequence to reduce the overall model while keeping the final recognition accuracy loss to a minimum.
Experiments on two large-scale sign language datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-07-03T00:55:45Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Fully Convolutional Networks for Continuous Sign Language Recognition [83.85895472824221]
Continuous sign language recognition is a challenging task that requires learning on both spatial and temporal dimensions.
We propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences.
arXiv Detail & Related papers (2020-07-24T08:16:37Z) - Learning Monocular Visual Odometry via Self-Supervised Long-Term
Modeling [106.15327903038705]
Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation.
We present a self-supervised learning method for VO with special consideration for consistency over longer sequences.
We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO.
arXiv Detail & Related papers (2020-07-21T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.