A Transformer-Based Contrastive Learning Approach for Few-Shot Sign
Language Recognition
- URL: http://arxiv.org/abs/2204.02803v1
- Date: Tue, 5 Apr 2022 11:42:55 GMT
- Title: A Transformer-Based Contrastive Learning Approach for Few-Shot Sign
Language Recognition
- Authors: Silvan Ferreira, Esdras Costa, M\'arcio Dahia, Jampierre Rocha
- Abstract summary: We propose a novel Contrastive Transformer-based model, which demonstrate to learn rich representations from body key points sequences.
Experiments showed that the model could generalize well and achieved competitive results for sign classes never seen in the training process.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sign language recognition from sequences of monocular images or 2D poses is a
challenging field, not only due to the difficulty to infer 3D information from
2D data, but also due to the temporal relationship between the sequences of
information. Additionally, the wide variety of signs and the constant need to
add new ones on production environments makes it infeasible to use traditional
classification techniques. We propose a novel Contrastive Transformer-based
model, which demonstrate to learn rich representations from body key points
sequences, allowing better comparison between vector embedding. This allows us
to apply these techniques to perform one-shot or few-shot tasks, such as
classification and translation. The experiments showed that the model could
generalize well and achieved competitive results for sign classes never seen in
the training process.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Purposer: Putting Human Motion Generation in Context [30.706219830149504]
We present a novel method to generate human motion to populate 3D indoor scenes.
It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds.
arXiv Detail & Related papers (2024-04-19T15:16:04Z) - Premonition: Using Generative Models to Preempt Future Data Changes in
Continual Learning [63.850451635362425]
Continual learning requires a model to adapt to ongoing changes in the data distribution.
We show that the combination of a large language model and an image generation model can similarly provide useful premonitions.
We find that the backbone of our pre-trained networks can learn representations useful for the downstream continual learning problem.
arXiv Detail & Related papers (2024-03-12T06:29:54Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Sign Language Production with Latent Motion Transformer [2.184775414778289]
We develop a new method to make high-quality sign videos without using human poses as a middle step.
Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features.
Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
arXiv Detail & Related papers (2023-12-20T10:53:06Z) - Grounded Language Acquisition From Object and Action Imagery [1.5566524830295307]
We explore the development of a private language for visual data representation.
For object recognition, a set of sketches produced by human participants from real imagery was used.
For action recognition, 2D trajectories were generated from 3D motion capture systems.
arXiv Detail & Related papers (2023-09-12T15:52:08Z) - DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.