Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm
- URL: http://arxiv.org/abs/2408.10488v1
- Date: Tue, 20 Aug 2024 02:01:30 GMT
- Title: Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm
- Authors: Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang,
- Abstract summary: This paper proposes the use of high-definition Event streams for Sign Language Translation.
Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well.
We propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features.
- Score: 46.002495818680934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on https://github.com/Event-AHU/OpenESL
Related papers
- EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More [7.974102031202597]
EvLight++ is a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios.
EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively.
arXiv Detail & Related papers (2024-08-29T04:30:31Z) - An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.
This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.
We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z) - EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - HARDVS: Revisiting Human Activity Recognition with Dynamic Vision
Sensors [40.949347728083474]
The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which are suffered from illumination, fast motion, privacy-preserving, and large energy consumption.
Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc.
As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR.
We propose a large-scale benchmark dataset, termed HARDVS, which contains 300 categories and more than 100K event sequences.
arXiv Detail & Related papers (2022-11-17T16:48:50Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.