Multi-scale temporal network for continuous sign language recognition
- URL: http://arxiv.org/abs/2204.03864v1
- Date: Fri, 8 Apr 2022 06:14:22 GMT
- Title: Multi-scale temporal network for continuous sign language recognition
- Authors: Qidan Zhu, Jing Li, Fei Yuan, Quan Gan
- Abstract summary: Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
- Score: 10.920363368754721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous Sign Language Recognition (CSLR) is a challenging research task
due to the lack of accurate annotation on the temporal sequence of sign
language data. The recent popular usage is a hybrid model based on "CNN + RNN"
for CSLR. However, when extracting temporal features in these works, most of
the methods using a fixed temporal receptive field and cannot extract the
temporal features well for each sign language word. In order to obtain more
accurate temporal features, this paper proposes a multi-scale temporal network
(MSTNet). The network mainly consists of three parts. The Resnet and two fully
connected (FC) layers constitute the frame-wise feature extraction part. The
time-wise feature extraction part performs temporal feature learning by first
extracting temporal receptive field features of different scales using the
proposed multi-scale temporal block (MST-block) to improve the temporal
modeling capability, and then further encoding the temporal features of
different scales by the transformers module to obtain more accurate temporal
features. Finally, the proposed multi-level Connectionist Temporal
Classification (CTC) loss part is used for training to obtain recognition
results. The multi-level CTC loss enables better learning and updating of the
shallow network parameters in CNN, and the method has no parameter increase and
can be flexibly embedded in other models. Experimental results on two publicly
available datasets demonstrate that our method can effectively extract sign
language features in an end-to-end manner without any prior knowledge,
improving the accuracy of CSLR and reaching the state-of-the-art.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Temporal superimposed crossover module for effective continuous sign
language [10.920363368754721]
This paper proposes a zero parameter, zero temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution.
Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
arXiv Detail & Related papers (2022-11-07T09:33:42Z) - Continuous Sign Language Recognition via Temporal Super-Resolution
Network [10.920363368754721]
This paper aims at the problem that the spatial-temporal hierarchical continuous sign language recognition model based on deep learning has a large amount of computation.
The data is reconstructed into a dense feature sequence to reduce the overall model while keeping the final recognition accuracy loss to a minimum.
Experiments on two large-scale sign language datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-07-03T00:55:45Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - Multi-View Spatial-Temporal Network for Continuous Sign Language
Recognition [0.76146285961466]
This paper proposes a multi-view spatial-temporal continuous sign language recognition network.
It is tested on two public sign language datasets SLR-100 and PHOENIX-Weather 2014T (RWTH)
arXiv Detail & Related papers (2022-04-19T08:43:03Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z) - Temporal Interlacing Network [8.876132549551738]
temporal interlacing network (TIN) is a simple yet powerful operator for learning temporal features.
TIN fuses the two kinds of information by interlacing spatial representations from the past to the future.
TIN wins the $1st$ in the ICCV19 - Multi Moments in Time challenge.
arXiv Detail & Related papers (2020-01-17T19:06:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.