Related papers: Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

URL: http://arxiv.org/abs/2502.02196v1
Date: Tue, 04 Feb 2025 10:21:28 GMT
Title: Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition
Authors: Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, Yanyan Wei,
Abstract summary: We present our solution to the Cross-View Isolated Sign Language Recognition (CV-I SLR) challenge held at WWW 2025.<n>CV-I SLR addresses a critical issue in traditional Isolated Sign Language Recognition (I SLR), where existing datasets predominantly capture sign language videos from a frontal perspective.<n>Our solution ranked 3rd in both the RGB-based I SLR and RGB-D-based I SLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition.
Score: 14.547488459868442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.

Related papers

GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework designed for robust visual speech recognition (VSR)<n>GLip learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data.<n>In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context.
arXiv Detail & Related papers (2025-09-19T14:36:01Z)
Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition [7.212104558068557]
This paper presents a dual-view sign language dataset for isolated sign language recognition named NationalCSL-DP.<n>The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views.<n>A CNN transformer network is also proposed as a strong baseline and an extremely simple but effective fusion strategy for prediction.
arXiv Detail & Related papers (2025-06-08T02:04:29Z)
Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation [2.6311088262657907]
This work proposes an Isolated Sign Language Recognition (ISLR) approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. We show that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS) In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
arXiv Detail & Related papers (2024-04-29T23:21:17Z)
A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z)
SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning. It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone. It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z)
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE) BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z)
GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z)
Hierarchical I3D for Sign Spotting [39.69485385546803]
We focus on the challenging of Sign Spotting instead of Isolated Sign Language Recognition. We propose a hierarchical sign spotting approach which learns coarse-to-finetemporal sign features. We achieve a state-of-the-art 0.607 F1 score, which was the top-1 winning solution of the ChaLearn 2022 Sign Spotting Challenge.
arXiv Detail & Related papers (2022-10-03T14:07:23Z)
Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations. UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z)
Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model. A Transformer-based model along with a C3D model is used for hand detection and deep features extraction. A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z)
Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z)
Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.