CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment
- URL: http://arxiv.org/abs/2303.05725v4
- Date: Wed, 12 Apr 2023 10:07:11 GMT
- Title: CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment
- Authors: Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia,
Yidong Chen, Stan Z. Li
- Abstract summary: Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses.
Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR.
We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
- Score: 42.10603331311837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language recognition (SLR) is a weakly supervised task that annotates
sign videos as textual glosses. Recent studies show that insufficient training
caused by the lack of large-scale available sign datasets becomes the main
bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and
develop two mainstream solutions. The multi-stream architectures extend
multi-cue visual features, yielding the current SOTA performances but requiring
complex designs and might introduce potential noise. Alternatively, the
advanced single-cue SLR frameworks using explicit cross-modal alignment between
visual and textual modalities are simple and effective, potentially competitive
with the multi-cue framework. In this work, we propose a novel contrastive
visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained
knowledge of both the visual and language modalities. Based on the single-cue
cross-modal alignment framework, we propose a variational autoencoder (VAE) for
pretrained contextual knowledge while introducing the complete pretrained
language module. The VAE implicitly aligns visual and textual modalities while
benefiting from pretrained contextual knowledge as the traditional contextual
module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to
explicitly enhance the consistency constraints. Extensive experiments on public
datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR
consistently outperforms existing single-cue methods and even outperforms SOTA
multi-cue methods.
Related papers
- Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension [21.500920290909843]
We propose a new pretraining paradigm for Large Language Models (LLMs) to enhance their visual comprehension capabilities.
Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens.
We present a new foundation model called Croc, which achieves new state-of-the-art performance on massive vision-language benchmarks.
arXiv Detail & Related papers (2024-10-18T09:44:25Z) - Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation.
By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.