PiSLTRc: Position-informed Sign Language Transformer with Content-aware
Convolution
- URL: http://arxiv.org/abs/2107.12600v1
- Date: Tue, 27 Jul 2021 05:01:27 GMT
- Title: PiSLTRc: Position-informed Sign Language Transformer with Content-aware
Convolution
- Authors: Pan Xie and Mengyi Zhao and Xiaohui Hu
- Abstract summary: We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics.
We explicitly select relevant features using a novel content-aware neighborhood gathering method.
We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation.
Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
- Score: 0.42970700836450487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since the superiority of Transformer in learning long-term dependency, the
sign language Transformer model achieves remarkable progress in Sign Language
Recognition (SLR) and Translation (SLT). However, there are several issues with
the Transformer that prevent it from better sign language understanding. The
first issue is that the self-attention mechanism learns sign video
representation in a frame-wise manner, neglecting the temporal semantic
structure of sign gestures. Secondly, the attention mechanism with absolute
position encoding is direction and distance unaware, thus limiting its ability.
To address these issues, we propose a new model architecture, namely PiSLTRc,
with two distinctive characteristics: (i) content-aware and position-aware
convolution layers. Specifically, we explicitly select relevant features using
a novel content-aware neighborhood gathering method. Then we aggregate these
features with position-informed temporal convolution layers, thus generating
robust neighborhood-enhanced sign representation. (ii) injecting the relative
position information to the attention mechanism in the encoder, decoder, and
even encoder-decoder cross attention. Compared with the vanilla Transformer
model, our model performs consistently better on three large-scale sign
language benchmarks: PHOENIX-2014, PHOENIX-2014-T and CSL. Furthermore,
extensive experiments demonstrate that the proposed method achieves
state-of-the-art performance on translation quality with $+1.6$ BLEU
improvements.
Related papers
- Linguistically Motivated Sign Language Segmentation [51.06873383204105]
We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
arXiv Detail & Related papers (2023-10-21T10:09:34Z) - Online Gesture Recognition using Transformer and Natural Language
Processing [0.0]
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
arXiv Detail & Related papers (2023-05-05T10:17:22Z) - Two-Stream Network for Sign Language Recognition and Translation [38.43767031555092]
We introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences.
The resulting model is called TwoStream- SLR, which is competent for sign language recognition.
TwoStream-SLT is extended to a sign language translation model, TwoStream-SLT, by simply attaching an extra translation network.
arXiv Detail & Related papers (2022-11-02T17:59:58Z) - Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning [8.944233327731245]
This paper proposes an improved Geometry Attention Transformer (GAT) model.
In order to further leverage geometric information, two novel geometry-aware architectures are designed.
Our GAT could often outperform current state-of-the-art image captioning models.
arXiv Detail & Related papers (2021-10-01T11:57:50Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.