StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition
- URL: http://arxiv.org/abs/2212.12857v2
- Date: Sun, 7 Apr 2024 06:34:37 GMT
- Title: StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition
- Authors: Xiaolong Shen, Zhedong Zheng, Yi Yang,
- Abstract summary: We propose a new framework called Spatial-temporal Part-aware network(StepNet) based on RGB parts.
Part-level Spatial Modeling automatically captures the appearance-based properties, such as hands and faces, in the feature space.
Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time.
- Score: 33.44126628779347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network~(StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.
Related papers
- USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition [3.8100688074986095]
Continuous sign language recognition requires precise modeling-temporal to accurately recognize sequences of gestures in videos.<n>These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies.<n>We propose the Unified Unified S-temporal Modeling (USTM) framework to address these limitations.<n>Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities.
arXiv Detail & Related papers (2025-12-15T15:05:16Z) - Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z) - Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition [1.949837893170278]
Hand gesture-based Sign Language Recognition serves as a crucial bridge between deaf and non-deaf individuals.<n>We propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture.<n>We validated our model through extensive experiments on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-21T04:57:18Z) - Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model [0.5825410941577593]
We propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images.
Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space.
arXiv Detail & Related papers (2024-08-26T08:55:16Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition [10.048809585477555]
Skeleton-aware sign language recognition has gained popularity due to its ability to remain unaffected by background information.
Current methods utilize spatial graph modules and temporal modules to capture spatial and temporal features, respectively.
We propose a new spatial architecture consisting of two concurrent branches, which build input-sensitive joint relationships.
We then propose a new temporal module to model multi-scale temporal information to capture complex human dynamics.
arXiv Detail & Related papers (2024-03-19T07:42:57Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Part-aware Prototypical Graph Network for One-shot Skeleton-based Action
Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes.
We propose a part-aware prototypical representation for one-shot skeleton-based action recognition.
We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z) - Spatial Temporal Graph Attention Network for Skeleton-Based Action
Recognition [10.60209288486904]
It's common for current methods in skeleton-based action recognition to mainly consider capturing long-term temporal dependencies.
We propose a general framework, coined as STGAT, to model cross-spacetime information flow.
STGAT achieves state-of-the-art performance on three large-scale datasets.
arXiv Detail & Related papers (2022-08-18T02:34:46Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - Denoised Non-Local Neural Network for Semantic Segmentation [18.84185406522064]
We propose a Denoised Non-Local Network (Denoised NL) to eliminate the inter-class and intra-class noises respectively.
Our proposed NL can achieve the state-of-the-art performance of 83.5% and 46.69% mIoU on Cityscapes and ADE20K, respectively.
arXiv Detail & Related papers (2021-10-27T06:16:31Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - The Devil is in the Boundary: Exploiting Boundary Representation for
Basis-based Instance Segmentation [85.153426159438]
We propose Basis based Instance(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods.
Our B2Inst leads to consistent improvements and accurately parses out the instance boundaries in a scene.
arXiv Detail & Related papers (2020-11-26T11:26:06Z) - Video-based Sign Language Recognition without Temporal Segmentation [88.03159640595187]
We propose a novel continuous sign recognition framework, which eliminates the preprocessing of temporal segmentation.<n>The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition.
arXiv Detail & Related papers (2018-01-30T17:37:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.