Graph Network for Sign Language Tasks
- URL: http://arxiv.org/abs/2504.12020v2
- Date: Sat, 17 May 2025 02:35:49 GMT
- Title: Graph Network for Sign Language Tasks
- Authors: Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Hongkai Wen, Lei Xie, Sanglu Lu,
- Abstract summary: We introduce MixSignGraph, which represents sign sequences as a group of mixed graphs.<n>The LSG module learns the correlation of intra-frame cross-region features within one frame.<n>The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames.<n>The HSG module aggregates the same-region features from different-granularity feature maps of a frame.
- Score: 22.71156540352475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
Related papers
- LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models [54.82915844507371]
Text-Attributed Graphs (TAGs) are ubiquitous in real-world scenarios.<n>Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures.<n>We propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning.
arXiv Detail & Related papers (2025-03-05T09:45:22Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - UniGLM: Training One Unified Language Model for Text-Attributed Graph Embedding [31.464021556351685]
Unified Graph Language Model (UniGLM) is a graph embedding model that generalizes well to both in-domain and cross-domain TAGs.
UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training.
arXiv Detail & Related papers (2024-06-17T19:45:21Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - UniGraph: Learning a Unified Cross-Domain Foundation Model for Text-Attributed Graphs [30.635472655668078]
Text-Attributed Graphs (TAGs) can generalize to unseen graphs and tasks across diverse domains.<n>We propose a novel cascaded architecture of Language Models (LMs) and Graph Neural Networks (GNNs) as backbone networks.<n>We demonstrate the model's effectiveness in self-supervised representation learning on unseen graphs, few-shot in-context transfer, and zero-shot transfer.
arXiv Detail & Related papers (2024-02-21T09:06:31Z) - Linguistically Motivated Sign Language Segmentation [51.06873383204105]
We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
arXiv Detail & Related papers (2023-10-21T10:09:34Z) - Weakly Supervised Semantic Segmentation by Knowledge Graph Inference [11.056545020611397]
This paper introduces a graph reasoning-based approach to enhance Weakly Supervised Semantic (WSSS)
The aim is to improve WSSS holistically by simultaneously enhancing both the multi-label classification and segmentation network stages.
We have achieved state-of-the-art performance in WSSS on the PASCAL VOC 2012 and MS-COCO datasets.
arXiv Detail & Related papers (2023-09-25T11:50:19Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Sign Language Translation with Hierarchical Spatio-TemporalGraph Neural
Network [6.623802929157273]
Sign language translation (SLT) generates text in a spoken language from visual content in a sign language.
In this paper, these unique characteristics of sign languages are formulated as hierarchical-temporal graph representations.
A novel deep learning architecture, namely hierarchical hierarchical-temporal graph neural network (HSTG-NN), is proposed.
arXiv Detail & Related papers (2021-11-14T07:02:28Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - Multi-Level Graph Convolutional Network with Automatic Graph Learning
for Hyperspectral Image Classification [63.56018768401328]
We propose a Multi-level Graph Convolutional Network (GCN) with Automatic Graph Learning method (MGCN-AGL) for HSI classification.
By employing attention mechanism to characterize the importance among spatially neighboring regions, the most relevant information can be adaptively incorporated to make decisions.
Our MGCN-AGL encodes the long range dependencies among image regions based on the expressive representations that have been produced at local level.
arXiv Detail & Related papers (2020-09-19T09:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.