Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition
- URL: http://arxiv.org/abs/2002.03187v1
- Date: Sat, 8 Feb 2020 15:38:44 GMT
- Title: Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition
- Authors: Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li
- Abstract summary: We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
- Score: 141.24314054768922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent success of deep learning in continuous sign language
recognition (CSLR), deep models typically focus on the most discriminative
features, ignoring other potentially non-trivial and informative contents. Such
characteristic heavily constrains their capability to learn implicit visual
grammars behind the collaboration of different visual cues (i,e., hand shape,
facial expression and body posture). By injecting multi-cue learning into
neural network design, we propose a spatial-temporal multi-cue (STMC) network
to solve the vision-based sequence learning problem. Our STMC network consists
of a spatial multi-cue (SMC) module and a temporal multi-cue (TMC) module. The
SMC module is dedicated to spatial representation and explicitly decomposes
visual features of different cues with the aid of a self-contained pose
estimation branch. The TMC module models temporal correlations along two
parallel paths, i.e., intra-cue and inter-cue, which aims to preserve the
uniqueness and explore the collaboration of multiple cues. Finally, we design a
joint optimization strategy to achieve the end-to-end sequence learning of the
STMC network. To validate the effectiveness, we perform experiments on three
large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental
results demonstrate that the proposed method achieves new state-of-the-art
performance on all three benchmarks.
Related papers
- SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation [53.010417880335424]
Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos.
Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data.
We propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations.
arXiv Detail & Related papers (2023-12-19T17:26:44Z) - SCD-Net: Spatiotemporal Clues Disentanglement Network for
Self-supervised Skeleton-based Action Recognition [39.99711066167837]
This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net)
Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively.
We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
arXiv Detail & Related papers (2023-09-11T21:32:13Z) - Spatial-Temporal Attention Network for Open-Set Fine-Grained Image
Recognition [14.450381668547259]
A vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images.
We propose a spatial-temporal attention network for learning fine-grained feature representations, called STAN.
The proposed STAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.
arXiv Detail & Related papers (2022-11-25T07:46:42Z) - When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class
Medical Image Semantic Segmentation [13.911947592067678]
In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented.
Our framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes.
Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set.
arXiv Detail & Related papers (2022-08-12T18:21:22Z) - Deep Image Clustering with Contrastive Learning and Multi-scale Graph
Convolutional Networks [58.868899595936476]
This paper presents a new deep clustering approach termed image clustering with contrastive learning and multi-scale graph convolutional networks (IcicleGCN)
Experiments on multiple image datasets demonstrate the superior clustering performance of IcicleGCN over the state-of-the-art.
arXiv Detail & Related papers (2022-07-14T19:16:56Z) - Multi-Perspective LSTM for Joint Visual Representation Learning [81.21490913108835]
We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives.
Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level.
We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
arXiv Detail & Related papers (2021-05-06T16:44:40Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Unsupervised Person Re-Identification with Multi-Label Learning Guided
Self-Paced Clustering [48.31017226618255]
Unsupervised person re-identification (Re-ID) has drawn increasing research attention recently.
In this paper, we address the unsupervised person Re-ID with a conceptually novel yet simple framework, termed as Multi-label Learning guided self-paced Clustering (MLC)
MLC mainly learns discriminative features with three crucial modules, namely a multi-scale network, a multi-label learning module, and a self-paced clustering module.
arXiv Detail & Related papers (2021-03-08T07:30:13Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.