TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation
- URL: http://arxiv.org/abs/2010.05468v1
- Date: Mon, 12 Oct 2020 05:58:09 GMT
- Title: TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation
- Authors: Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Ben Swift, Hanna
Suominen, Hongdong Li
- Abstract summary: Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
- Score: 101.6042317204022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language translation (SLT) aims to interpret sign video sequences into
text-based natural language sentences. Sign videos consist of continuous
sequences of sign gestures with no clear boundaries in between. Existing SLT
models usually represent sign visual features in a frame-wise manner so as to
avoid needing to explicitly segmenting the videos into isolated signs. However,
these methods neglect the temporal information of signs and lead to substantial
ambiguity in translation. In this paper, we explore the temporal semantic
structures of signvideos to learn more discriminative features. To this end, we
first present a novel sign video segment representation which takes into
account multiple temporal granularities, thus alleviating the need for accurate
video segmentation. Taking advantage of the proposed segment representation, we
develop a novel hierarchical sign video feature learning method via a temporal
semantic pyramid network, called TSPNet. Specifically, TSPNet introduces an
inter-scale attention to evaluate and enhance local semantic consistency of
sign segments and an intra-scale attention to resolve semantic ambiguity by
using non-local video context. Experiments show that our TSPNet outperforms the
state-of-the-art with significant improvements on the BLEU score (from 9.58 to
13.41) and ROUGE score (from 31.80 to 34.96)on the largest commonly-used SLT
dataset. Our implementation is available at
https://github.com/verashira/TSPNet.
Related papers
- Linguistically Motivated Sign Language Segmentation [51.06873383204105]
We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
arXiv Detail & Related papers (2023-10-21T10:09:34Z) - LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision [44.13777026011408]
We learn semantic properties that capture rich spatial and temporal in video data by leveraging high-level logic specifications.
We evaluate our method on three datasets with rich spatial representations and temporal specifications: 20BN-Something-GEN, MUGEN, and OpenPVSG.
arXiv Detail & Related papers (2023-04-15T22:24:05Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Looking for the Signs: Identifying Isolated Sign Instances in Continuous
Video Footage [45.29710323525548]
We propose a transformer-based network, called SignLookup, to extract-temporal representations from video clips.
Our model achieves state-of-the-art performance on the sign spotting task with accuracy as high as 96% on challenging benchmark datasets.
arXiv Detail & Related papers (2021-07-21T12:49:44Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Sign language segmentation with temporal convolutional networks [25.661006537351547]
Our approach employs 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues.
We demonstrate the effectiveness of our approach on the BSLCORPUS, PHOENIX14 and BSL-1K datasets.
arXiv Detail & Related papers (2020-11-25T19:11:48Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.