Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign
Language Recognition
- URL: http://arxiv.org/abs/2107.12762v1
- Date: Tue, 27 Jul 2021 12:06:56 GMT
- Title: Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign
Language Recognition
- Authors: Pan Xie, Zhi Cui, Yao Du, Mengyi Zhao, Jianwei Cui, Bin Wang, Xiaohui
Hu
- Abstract summary: Continuous sign language recognition is a public significant task that transcribes a sign language video into an ordered gloss sequence.
One promising way is to adopt a one-dimensional convolutional network (1D-CNN) to temporally fuse the sequential frames.
We propose to adaptively fuse local features via temporal similarity for this task.
- Score: 4.059599144668737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continuous sign language recognition (cSLR) is a public significant task that
transcribes a sign language video into an ordered gloss sequence. It is
important to capture the fine-grained gloss-level details, since there is no
explicit alignment between sign video frames and the corresponding glosses.
Among the past works, one promising way is to adopt a one-dimensional
convolutional network (1D-CNN) to temporally fuse the sequential frames.
However, CNNs are agnostic to similarity or dissimilarity, and thus are unable
to capture local consistent semantics within temporally neighboring frames. To
address the issue, we propose to adaptively fuse local features via temporal
similarity for this task. Specifically, we devise a Multi-scale Local-Temporal
Similarity Fusion Network (mLTSF-Net) as follows: 1) In terms of a specific
video frame, we firstly select its similar neighbours with multi-scale
receptive regions to accommodate different lengths of glosses. 2) To ensure
temporal consistency, we then use position-aware convolution to temporally
convolve each scale of selected frames. 3) To obtain a local-temporally
enhanced frame-wise representation, we finally fuse the results of different
scales using a content-dependent aggregator. We train our model in an
end-to-end fashion, and the experimental results on RWTH-PHOENIX-Weather 2014
datasets (RWTH) demonstrate that our model achieves competitive performance
compared with several state-of-the-art models.
Related papers
- FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals
in Factorized Orthogonal Latent Space [7.324708513042455]
This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals.
It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin.
arXiv Detail & Related papers (2023-10-30T22:55:29Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild [19.5702895176141]
We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
arXiv Detail & Related papers (2022-05-10T08:47:15Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video
Person Re-Identification [86.73532136686438]
We present an efficient spatial-temporal representation for video person re-identification (reID)
We propose a Bilateral Complementary Network (BiCnet) for spatial complementarity modeling.
BiCnet-TKS outperforms state-of-the-arts with about 50% less computations.
arXiv Detail & Related papers (2021-04-30T06:44:34Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.