Continuous Sign Language Recognition via Temporal Super-Resolution
Network
- URL: http://arxiv.org/abs/2207.00928v1
- Date: Sun, 3 Jul 2022 00:55:45 GMT
- Title: Continuous Sign Language Recognition via Temporal Super-Resolution
Network
- Authors: Qidan Zhu, Jing Li, Fei Yuan, Quan Gan
- Abstract summary: This paper aims at the problem that the spatial-temporal hierarchical continuous sign language recognition model based on deep learning has a large amount of computation.
The data is reconstructed into a dense feature sequence to reduce the overall model while keeping the final recognition accuracy loss to a minimum.
Experiments on two large-scale sign language datasets demonstrate the effectiveness of the proposed model.
- Score: 10.920363368754721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aiming at the problem that the spatial-temporal hierarchical continuous sign
language recognition model based on deep learning has a large amount of
computation, which limits the real-time application of the model, this paper
proposes a temporal super-resolution network(TSRNet). The data is reconstructed
into a dense feature sequence to reduce the overall model computation while
keeping the final recognition accuracy loss to a minimum. The continuous sign
language recognition model(CSLR) via TSRNet mainly consists of three parts:
frame-level feature extraction, time series feature extraction and TSRNet,
where TSRNet is located between frame-level feature extraction and time-series
feature extraction, which mainly includes two branches: detail descriptor and
rough descriptor. The sparse frame-level features are fused through the
features obtained by the two designed branches as the reconstructed dense
frame-level feature sequence, and the connectionist temporal
classification(CTC) loss is used for training and optimization after the
time-series feature extraction part. To better recover semantic-level
information, the overall model is trained with the self-generating adversarial
training method proposed in this paper to reduce the model error rate. The
training method regards the TSRNet as the generator, and the frame-level
processing part and the temporal processing part as the discriminator. In
addition, in order to unify the evaluation criteria of model accuracy loss
under different benchmarks, this paper proposes word error rate
deviation(WERD), which takes the error rate between the estimated word error
rate (WER) and the reference WER obtained by the reconstructed frame-level
feature sequence and the complete original frame-level feature sequence as the
WERD. Experiments on two large-scale sign language datasets demonstrate the
effectiveness of the proposed model.
Related papers
- Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Gated Recurrent Neural Networks with Weighted Time-Delay Feedback [59.125047512495456]
We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism.
We show that $tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures.
arXiv Detail & Related papers (2022-12-01T02:26:34Z) - Temporal superimposed crossover module for effective continuous sign
language [10.920363368754721]
This paper proposes a zero parameter, zero temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution.
Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
arXiv Detail & Related papers (2022-11-07T09:33:42Z) - Learning Signal Temporal Logic through Neural Network for Interpretable
Classification [13.829082181692872]
We propose an explainable neural-symbolic framework for the classification of time-series behaviors.
We demonstrate the computational efficiency, compactness, and interpretability of the proposed method through driving scenarios and naval surveillance case studies.
arXiv Detail & Related papers (2022-10-04T21:11:54Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - Self-Supervised Video Object Segmentation via Cutout Prediction and
Tagging [117.73967303377381]
We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability.
Our approach is based on a discriminative learning loss formulation that takes into account both object and background information.
Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
arXiv Detail & Related papers (2022-04-22T17:53:27Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - Evaluation and Comparison of Deep Learning Methods for Pavement Crack
Identification with Visual Images [0.0]
pavement crack identification with visual images via deep learning algorithms has the advantages of not being limited by the material of object to be detected.
In the aspect of patch sample classification, the fine-tuned TL models can be equivalent to or even slightly better than the ED models in accuracy.
In the aspect of accurate crack location, both ED and GAN algorithms can achieve pixel-level segmentation and is expected to be detected in real time on low computing power platform.
arXiv Detail & Related papers (2021-12-20T08:23:43Z) - Robust lEarned Shrinkage-Thresholding (REST): Robust unrolling for
sparse recover [87.28082715343896]
We consider deep neural networks for solving inverse problems that are robust to forward model mis-specifications.
We design a new robust deep neural network architecture by applying algorithm unfolding techniques to a robust version of the underlying recovery problem.
The proposed REST network is shown to outperform state-of-the-art model-based and data-driven algorithms in both compressive sensing and radar imaging problems.
arXiv Detail & Related papers (2021-10-20T06:15:45Z) - Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video
Super-Resolution [4.9136996406481135]
Video super-resolution (VSR) aims to estimate a high-resolution (HR) frame from a low-resolution (LR) frames.
Key challenge for VSR lies in the effective exploitation of spatial correlation in an intra-frame and temporal dependency between consecutive frames.
arXiv Detail & Related papers (2021-06-14T06:36:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.