Continuous Sign Language Recognition with Correlation Network
- URL: http://arxiv.org/abs/2303.03202v3
- Date: Sat, 18 Mar 2023 12:31:42 GMT
- Title: Continuous Sign Language Recognition with Correlation Network
- Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Wei Feng
- Abstract summary: We propose correlation network (CorrNet) to explicitly capture and leverage body trajectories across frames to identify signs.
CorrNet achieves new state-of-the-art accuracy on four large-scale datasets.
- Score: 6.428695655854854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human body trajectories are a salient cue to identify actions in the video.
Such body trajectories are mainly conveyed by hands and face across consecutive
frames in sign language. However, current methods in continuous sign language
recognition (CSLR) usually process frames independently, thus failing to
capture cross-frame trajectories to effectively identify a sign. To handle this
limitation, we propose correlation network (CorrNet) to explicitly capture and
leverage body trajectories across frames to identify signs. In specific, a
correlation module is first proposed to dynamically compute correlation maps
between the current frame and adjacent frames to identify trajectories of all
spatial patches. An identification module is then presented to dynamically
emphasize the body trajectories within these correlation maps. As a result, the
generated features are able to gain an overview of local temporal movements to
identify a sign. Thanks to its special attention on body trajectories, CorrNet
achieves new state-of-the-art accuracy on four large-scale datasets, i.e.,
PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. A comprehensive comparison with
previous spatial-temporal reasoning methods verifies the effectiveness of
CorrNet. Visualizations demonstrate the effects of CorrNet on emphasizing human
body trajectories across adjacent frames.
Related papers
- Local All-Pair Correspondence for Point Tracking [59.76186266230608]
We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences.
LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.
arXiv Detail & Related papers (2024-07-22T06:49:56Z) - CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation [16.961613400566474]
This paper introduces a spatial-temporal correlation network, denoted as CorrNet+, which explicitly identifies body trajectories across multiple frames.
As a unified model, CorrNet+ achieves new state-of-the-art performance on two extensive sign language understanding tasks.
arXiv Detail & Related papers (2024-04-17T06:57:57Z) - TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions [10.954210339694841]
A key in continuous sign language recognition (CSL) is efficiently captured long-range spatial interactions over time from input input.
We propose TCNet, a hybrid network that effectively models video information trajectories from Trajectories and Cortemporalrelated regions.
We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL, respectively.
arXiv Detail & Related papers (2024-03-18T14:20:17Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - HAGCN : Network Decentralization Attention Based Heterogeneity-Aware
Spatiotemporal Graph Convolution Network for Traffic Signal Forecasting [0.0]
We study the heterogeneous characteristics inherent in traffic signal data to learn hidden relationships between sensors in various ways.
We propose a network decentralization attention-aware graph convolution network (HAGCN) method that aggregates the hidden states of adjacent nodes.
arXiv Detail & Related papers (2022-09-05T13:45:52Z) - DMGCRN: Dynamic Multi-Graph Convolution Recurrent Network for Traffic
Forecasting [7.232141271583618]
We propose a novel dynamic multi-graph convolution recurrent network (DMG) to tackle above issues.
We use the distance-based graph to capture spatial information from nodes are close in distance.
We also construct a novel latent graph which encoded the structure correlations among roads to capture spatial information from nodes are similar in structure.
arXiv Detail & Related papers (2021-12-04T06:51:55Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Sign language segmentation with temporal convolutional networks [25.661006537351547]
Our approach employs 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues.
We demonstrate the effectiveness of our approach on the BSLCORPUS, PHOENIX14 and BSL-1K datasets.
arXiv Detail & Related papers (2020-11-25T19:11:48Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.