Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
- URL: http://arxiv.org/abs/2503.06484v1
- Date: Sun, 09 Mar 2025 06:55:46 GMT
- Title: Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
- Authors: Xiao Wang, Yuehang Li, Fuling Wang, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang, Bin Luo,
- Abstract summary: Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements.<n>We propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above.<n>Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters.
- Score: 58.60058450730943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M$^2$-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.
Related papers
- E-VLC: A Real-World Dataset for Event-based Visible Light Communication And Localization [4.269675382023856]
Event cameras can be used to decode the LED signals and also to the camera relative to the LED marker positions.
There is no public dataset to benchmark the decoding and localization in various real-world settings.
We present the first public dataset that consists of an event camera, a frame camera, and ground-truth poses that are precisely synchronized with hardware triggers.
arXiv Detail & Related papers (2025-04-25T17:43:20Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition [39.12227212510573]
Scene Text Recognition algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds.
We propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR.
We also propose a new event-based scene text recognition framework, termed SimC-ESTR.
arXiv Detail & Related papers (2025-02-13T07:16:16Z) - EventGPT: Event Stream Understanding with Multimodal Large Language Models [59.65010502000344]
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions.
Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better.
We introduce EventGPT, the first MLLM for event stream understanding.
arXiv Detail & Related papers (2024-12-01T14:38:40Z) - Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm [46.002495818680934]
This paper proposes the use of high-definition Event streams for Sign Language Translation.
Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well.
We propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features.
arXiv Detail & Related papers (2024-08-20T02:01:30Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition [42.118434116034194]
We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
arXiv Detail & Related papers (2023-08-08T16:15:35Z) - Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for
Robotic Grasping [4.191965713559235]
We propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data.
The Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions.
The evaluation results show a 6-10% improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy.
arXiv Detail & Related papers (2023-03-20T16:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.