Self-Supervised Video Transformers for Isolated Sign Language
Recognition
- URL: http://arxiv.org/abs/2309.02450v1
- Date: Sat, 2 Sep 2023 03:00:03 GMT
- Title: Self-Supervised Video Transformers for Isolated Sign Language
Recognition
- Authors: Marcelo Sandoval-Castaneda, Yanhong Li, Diane Brentari, Karen Livescu,
Gregory Shakhnarovich
- Abstract summary: We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes.
MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000.
- Score: 19.72944125318495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an in-depth analysis of various self-supervision methods
for isolated sign language recognition (ISLR). We consider four recently
introduced transformer-based approaches to self-supervised learning from
videos, and four pre-training data regimes, and study all the combinations on
the WLASL2000 dataset. Our findings reveal that MaskFeat achieves performance
superior to pose-based and supervised video models, with a top-1 accuracy of
79.02% on gloss-based WLASL2000. Furthermore, we analyze these models' ability
to produce representations of ASL signs using linear probing on diverse
phonological features. This study underscores the value of architecture and
pre-training task choices in ISLR. Specifically, our results on WLASL2000
highlight the power of masked reconstruction pre-training, and our linear
probing results demonstrate the importance of hierarchical vision transformers
for sign language representation.
Related papers
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features [48.11426546401525]
We introduce SigLIP 2, a family of new multilingual vision-language encoders.
We extend the original image-text training objective with several prior, independently developed techniques into a unified recipe.
New training recipe leads to significant improvements on localization and dense prediction tasks.
arXiv Detail & Related papers (2025-02-20T18:08:29Z) - Training Strategies for Isolated Sign Language Recognition [72.27323884094953]
This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition.
The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds.
We achieve a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution.
arXiv Detail & Related papers (2024-12-16T08:37:58Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - A Quantitative Approach to Understand Self-Supervised Models as
Cross-lingual Feature Extractors [9.279391026742658]
We analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor.
We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations.
arXiv Detail & Related papers (2023-11-27T15:58:28Z) - Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables [2.048226951354646]
We study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models.
We also investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model.
Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.
arXiv Detail & Related papers (2023-09-17T09:18:04Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.