Temporal Lift Pooling for Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2207.08734v1
- Date: Mon, 18 Jul 2022 16:28:00 GMT
- Title: Temporal Lift Pooling for Continuous Sign Language Recognition
- Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Wei Feng
- Abstract summary: We derive temporal lift pooling (TLP) from the Lifting Scheme in signal processing to intelligently downsample features of different temporal hierarchies.
Our TLP is a three-stage procedure, which performs signal decomposition, component weighting and information fusion to generate a refined downsized feature map.
Experiments on two large-scale datasets show TLP outperforms hand-crafted methods and specialized spatial variants by a large margin (1.5%) with similar computational overhead.
- Score: 6.428695655854854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pooling methods are necessities for modern neural networks for increasing
receptive fields and lowering down computational costs. However, commonly used
hand-crafted pooling approaches, e.g., max pooling and average pooling, may not
well preserve discriminative features. While many researchers have elaborately
designed various pooling variants in spatial domain to handle these limitations
with much progress, the temporal aspect is rarely visited where directly
applying hand-crafted methods or these specialized spatial variants may not be
optimal. In this paper, we derive temporal lift pooling (TLP) from the Lifting
Scheme in signal processing to intelligently downsample features of different
temporal hierarchies. The Lifting Scheme factorizes input signals into various
sub-bands with different frequency, which can be viewed as different temporal
movement patterns. Our TLP is a three-stage procedure, which performs signal
decomposition, component weighting and information fusion to generate a refined
downsized feature map. We select a typical temporal task with long sequences,
i.e. continuous sign language recognition (CSLR), as our testbed to verify the
effectiveness of TLP. Experiments on two large-scale datasets show TLP
outperforms hand-crafted methods and specialized spatial variants by a large
margin (1.5%) with similar computational overhead. As a robust feature
extractor, TLP exhibits great generalizability upon multiple backbones on
various datasets and achieves new state-of-the-art results on two large-scale
CSLR datasets. Visualizations further demonstrate the mechanism of TLP in
correcting gloss borders. Code is released.
Related papers
- Multi-Source and Test-Time Domain Adaptation on Multivariate Signals using Spatio-Temporal Monge Alignment [59.75420353684495]
Machine learning applications on signals such as computer vision or biomedical data often face challenges due to the variability that exists across hardware devices or session recordings.
In this work, we propose Spatio-Temporal Monge Alignment (STMA) to mitigate these variabilities.
We show that STMA leads to significant and consistent performance gains between datasets acquired with very different settings.
arXiv Detail & Related papers (2024-07-19T13:33:38Z) - Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition [10.048809585477555]
Skeleton-aware sign language recognition has gained popularity due to its ability to remain unaffected by background information.
Current methods utilize spatial graph modules and temporal modules to capture spatial and temporal features, respectively.
We propose a new spatial architecture consisting of two concurrent branches, which build input-sensitive joint relationships.
We then propose a new temporal module to model multi-scale temporal information to capture complex human dynamics.
arXiv Detail & Related papers (2024-03-19T07:42:57Z) - A Multi-Stage Adaptive Feature Fusion Neural Network for Multimodal Gait
Recognition [15.080096318551346]
Most existing gait recognition algorithms are unimodal, and a few multimodal gait recognition algorithms perform multimodal fusion only once.
We propose a multi-stage feature fusion strategy (MSFFS), which performs multimodal fusions at different stages in the feature extraction process.
Also, we propose an adaptive feature fusion module (AFFM) that considers the semantic association between silhouettes and skeletons.
arXiv Detail & Related papers (2023-12-22T03:25:15Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Hierarchical Spherical CNNs with Lifting-based Adaptive Wavelets for
Pooling and Unpooling [101.72318949104627]
We propose a novel framework of hierarchical convolutional neural networks (HS-CNNs) with a lifting structure to learn adaptive spherical wavelets for pooling and unpooling.
LiftHS-CNN ensures a more efficient hierarchical feature learning for both image- and pixel-level tasks.
arXiv Detail & Related papers (2022-05-31T07:23:42Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction [138.04956118993934]
We propose a novel Transformer-based method, coarse-to-fine sparse Transformer (CST)
CST embedding HSI sparsity into deep learning for HSI reconstruction.
In particular, CST uses our proposed spectra-aware screening mechanism (SASM) for coarse patch selecting. Then the selected patches are fed into our customized spectra-aggregation hashing multi-head self-attention (SAH-MSA) for fine pixel clustering and self-similarity capturing.
arXiv Detail & Related papers (2022-03-09T16:17:47Z) - Sequential Place Learning: Heuristic-Free High-Performance Long-Term
Place Recognition [24.70946979449572]
We develop a learning-based CNN+LSTM architecture, trainable via backpropagation through time, for viewpoint- and appearance-invariant place recognition.
Our model outperforms 15 classical methods while setting new state-of-the-art performance standards.
In addition, we show that SPL can be up to 70x faster to deploy than classical methods on a 729 km route.
arXiv Detail & Related papers (2021-03-02T22:57:43Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.