Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition
- URL: http://arxiv.org/abs/2505.02304v2
- Date: Tue, 22 Jul 2025 00:46:21 GMT
- Title: Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition
- Authors: Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao,
- Abstract summary: We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method to produce precise multipart descriptions.<n>The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions.<n> Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy)
- Score: 9.044039469025009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.
Related papers
- Representing Signs as Signs: One-Shot ISLR to Facilitate Functional Sign Language Technologies [6.403291706982091]
Isolated Sign Language Recognition is crucial for scalable language technology.<n>We propose a one-shot learning approach that generalises across languages and evolving vocabularies.<n>We achieve state-of-the-art results, including 50.8% one-shot MRR on a large dictionary containing 10,235 unique signs from a different language.
arXiv Detail & Related papers (2025-02-27T15:07:51Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation [6.688680877428467]
We propose a novel gloss-free Multimodal Sign Language Translation framework.
We generate detailed textual descriptions of sign language components using multimodal large language models.
Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily.
arXiv Detail & Related papers (2024-11-25T09:01:41Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM [1.3089936156875277]
We introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector.
We propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task.
We also present a connecter with MoE architecture that manages multiple languages efficiently.
arXiv Detail & Related papers (2024-09-24T09:20:22Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems.
We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences.
Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z) - Global-local Enhancement Network for NMFs-aware Sign Language
Recognition [135.30357113518127]
We propose a simple yet effective architecture called Global-local Enhancement Network (GLE-Net)
Of the two streams, one captures the global contextual relationship, while the other stream captures the discriminative fine-grained cues.
We introduce the first non-manual-features-aware isolated Chinese sign language dataset with a total vocabulary size of 1,067 sign words in daily life.
arXiv Detail & Related papers (2020-08-24T13:28:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.