Related papers: Online Gesture Recognition using Transformer and Natural Language Processing

Online Gesture Recognition using Transformer and Natural Language Processing

URL: http://arxiv.org/abs/2305.03407v1
Date: Fri, 5 May 2023 10:17:22 GMT
Title: Online Gesture Recognition using Transformer and Natural Language Processing
Authors: G.C.M. Silvestre, F. Balado, O. Akinremi and M. Ramo
Abstract summary: Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences. Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Transformer architecture is shown to provide a powerful machine transduction framework for online handwritten gestures corresponding to glyph strokes of natural language sentences. The attention mechanism is successfully used to create latent representations of an end-to-end encoder-decoder model, solving multi-level segmentation while also learning some language features and syntax rules. The additional use of a large decoding space with some learned Byte-Pair-Encoding (BPE) is shown to provide robustness to ablated inputs and syntax rules. The encoder stack was directly fed with spatio-temporal data tokens potentially forming an infinitely large input vocabulary, an approach that finds applications beyond that of this work. Encoder transfer learning capabilities is also demonstrated on several languages resulting in faster optimisation and shared parameters. A new supervised dataset of online handwriting gestures suitable for generic handwriting recognition tasks was used to successfully train a small transformer model to an average normalised Levenshtein accuracy of 96% on English or German sentences and 94% in French.

Related papers

Overcoming Vocabulary Constraints with Pixel-level Fallback [9.753745943931207]
Subword tokenization requires balancing computational efficiency and vocabulary coverage. We propose a vocabulary-free encoder that generates input embeddings from text rendered as pixels.
arXiv Detail & Related papers (2025-04-02T20:50:31Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language. Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z)
A Transformer Architecture for Online Gesture Recognition of Mathematical Expressions [0.0]
Transformer architecture is shown to provide an end-to-end model for building expression trees from online handwritten gestures corresponding to glyph strokes. The attention mechanism was successfully used to encode, learn and enforce the underlying syntax of expressions. For the first time, the encoder is fed with unseen online-temporal data tokens potentially forming an infinitely large vocabulary.
arXiv Detail & Related papers (2022-11-04T17:55:55Z)
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z)
Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z)
Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer [4.594159253008448]
We propose a multi-task learning-based transformer model for low-resource multilingual speech recognition for Indian languages. We use a phoneme decoder for the phoneme recognition task and a grapheme decoder to predict grapheme sequence. Our proposed approach can obtain significant improvement over previous approaches.
arXiv Detail & Related papers (2021-08-22T09:32:15Z)
A Dual-Decoder Conformer for Multilingual Speech Recognition [4.594159253008448]
This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information. Our experiments show that we can obtain a significant reduction in WER over the baseline approaches.
arXiv Detail & Related papers (2021-08-22T09:22:28Z)
Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation [127.54315184545796]
Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language. We propose to improve the multitask ST model by utilizing word embedding as the intermediate.
arXiv Detail & Related papers (2020-05-21T14:22:35Z)
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task. Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.