Hierarchical Sub-action Tree for Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2506.20947v1
- Date: Thu, 26 Jun 2025 02:27:50 GMT
- Title: Hierarchical Sub-action Tree for Continuous Sign Language Recognition
- Authors: Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu,
- Abstract summary: Continuous sign language recognition aims to transcribe untrimmed videos into glosses, which are typically textual words.<n>Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data.<n>We propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning.
- Score: 4.929852718777036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.
Related papers
- Language-Image Alignment with Fixed Text Encoders [28.898689028197005]
Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly.<n>In this work, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning.
arXiv Detail & Related papers (2025-06-04T17:51:56Z) - Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z) - Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search [35.20525123189316]
Session search involves a series of interactive queries and actions to fulfill user's complex information need.<n>Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions.<n>We propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches.
arXiv Detail & Related papers (2025-05-20T10:05:06Z) - SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation [23.60337935010744]
We present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation.<n> Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections.<n>In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques.
arXiv Detail & Related papers (2024-12-16T10:53:24Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.