TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions
- URL: http://arxiv.org/abs/2403.11818v1
- Date: Mon, 18 Mar 2024 14:20:17 GMT
- Title: TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions
- Authors: Hui Lu, Albert Ali Salah, Ronald Poppe,
- Abstract summary: A key in continuous sign language recognition (CSL) is efficiently captured long-range spatial interactions over time from input input.
We propose TCNet, a hybrid network that effectively models video information trajectories from Trajectories and Cortemporalrelated regions.
We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL, respectively.
- Score: 10.954210339694841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - NAC-TCN: Temporal Convolutional Networks with Causal Dilated
Neighborhood Attention for Emotion Understanding [60.74434735079253]
We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN)
We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions.
Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets.
arXiv Detail & Related papers (2023-12-12T18:41:30Z) - Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data [50.84488941336865]
We propose a novel method called Fully- Spatial-Temporal Graph Neural Network (FC-STGNN)
For graph construction, we design a decay graph to connect sensors across all timestamps based on their temporal distances.
For graph convolution, we devise FC graph convolution with a moving-pooling GNN layer to effectively capture the ST dependencies for learning effective representations.
arXiv Detail & Related papers (2023-09-11T08:44:07Z) - ESGCN: Edge Squeeze Attention Graph Convolutional Network for Traffic
Flow Forecasting [15.475463516901938]
We propose a network Edge Squeeze Convolutional Network (ESCN) to forecast traffic flow in multiple regions.
ESGCN achieves state-of-the-art performance by a large margin on four realworld datasets.
arXiv Detail & Related papers (2023-07-03T04:47:42Z) - Continuous Sign Language Recognition with Correlation Network [6.428695655854854]
We propose correlation network (CorrNet) to explicitly capture and leverage body trajectories across frames to identify signs.
CorrNet achieves new state-of-the-art accuracy on four large-scale datasets.
arXiv Detail & Related papers (2023-03-06T15:02:12Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Traffic Flow Forecasting with Spatial-Temporal Graph Diffusion Network [39.65520262751766]
We develop a new traffic prediction framework-Spatial-Temporal Graph Diffusion Network (ST-GDN)
In particular, ST-GDN is a hierarchically structured graph neural architecture which learns not only the local region-wise geographical dependencies, but also the spatial semantics from a global perspective.
Experiments on several real-life traffic datasets demonstrate that ST-GDN outperforms different types of state-of-the-art baselines.
arXiv Detail & Related papers (2021-10-08T11:19:06Z) - Space Meets Time: Local Spacetime Neural Network For Traffic Flow
Forecasting [11.495992519252585]
We argue that such correlations are universal and play a pivotal role in traffic flow.
We propose a new spacetime interval learning framework that constructs a local-spacetime context of a traffic sensor.
The proposed STNN model can be applied on any unseen traffic networks.
arXiv Detail & Related papers (2021-09-11T09:04:35Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.