SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning
- URL: http://arxiv.org/abs/2401.11847v1
- Date: Mon, 22 Jan 2024 11:04:55 GMT
- Title: SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning
- Authors: Hao Chen, Jiaze Wang, Ziyu Guo, Jinpeng Li, Donghao Zhou, Bian Wu,
Chenyong Guan, Guangyong Chen, Pheng-Ann Heng
- Abstract summary: SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
- Score: 51.800031281177105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language recognition (SLR) plays a vital role in facilitating
communication for the hearing-impaired community. SLR is a weakly supervised
task where entire videos are annotated with glosses, making it challenging to
identify the corresponding gloss within a video segment. Recent studies
indicate that the main bottleneck in SLR is the insufficient training caused by
the limited availability of large-scale datasets. To address this challenge, we
present SignVTCL, a multi-modal continuous sign language recognition framework
enhanced by visual-textual contrastive learning, which leverages the full
potential of multi-modal data and the generalization ability of language model.
SignVTCL integrates multi-modal data (video, keypoints, and optical flow)
simultaneously to train a unified visual backbone, thereby yielding more robust
visual representations. Furthermore, SignVTCL contains a visual-textual
alignment approach incorporating gloss-level and sentence-level alignment to
ensure precise correspondence between visual features and glosses at the level
of individual glosses and sentence. Experimental results conducted on three
datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL
achieves state-of-the-art results compared with previous methods.
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder.
To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework.
Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses.
Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR.
We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.