Writing in The Air: Unconstrained Text Recognition from Finger Movement
Using Spatio-Temporal Convolution
- URL: http://arxiv.org/abs/2104.09021v1
- Date: Mon, 19 Apr 2021 02:37:46 GMT
- Title: Writing in The Air: Unconstrained Text Recognition from Finger Movement
Using Spatio-Temporal Convolution
- Authors: Ue-Hwan Kim, Yewon Hwang, Sun-Kyung Lee, Jong-Hwan Kim
- Abstract summary: In this paper, we introduce a new benchmark dataset for the challenging writing in the air (WiTA) task.
WiTA implements an intuitive and natural writing method with finger movement for human-computer interaction.
Our dataset consists of five sub-datasets in two languages (Korean and English) and amounts to 209,926 instances from 122 participants.
- Score: 3.3502165500990824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a new benchmark dataset for the challenging
writing in the air (WiTA) task -- an elaborate task bridging vision and NLP.
WiTA implements an intuitive and natural writing method with finger movement
for human-computer interaction (HCI). Our WiTA dataset will facilitate the
development of data-driven WiTA systems which thus far have displayed
unsatisfactory performance -- due to lack of dataset as well as traditional
statistical models they have adopted. Our dataset consists of five sub-datasets
in two languages (Korean and English) and amounts to 209,926 video instances
from 122 participants. We capture finger movement for WiTA with RGB cameras to
ensure wide accessibility and cost-efficiency. Next, we propose spatio-temporal
residual network architectures inspired by 3D ResNet. These models perform
unconstrained text recognition from finger movement, guarantee a real-time
operation by processing 435 and 697 decoding frames-per-second for Korean and
English, respectively, and will serve as an evaluation standard. Our dataset
and the source codes are available at https://github.com/Uehwan/WiTA.
Related papers
- EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Enhancing Sign Language Detection through Mediapipe and Convolutional Neural Networks (CNN) [3.192629447369627]
This research combines MediaPipe and CNNs for the efficient and accurate interpretation of ASL dataset.
The accuracy achieved by the model on ASL datasets is 99.12%.
The system will have applications in the communication, education, and accessibility domains.
arXiv Detail & Related papers (2024-06-06T04:05:12Z) - Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Robotic Detection of a Human-Comprehensible Gestural Language for
Underwater Multi-Human-Robot Collaboration [16.823029377470363]
We present a motion-based robotic communication framework that enables non-verbal communication among autonomous underwater vehicles (AUVs) and human divers.
We design a gestural language for AUV-to-A communication which can be easily understood by divers observing the conversation.
To allow As to visually understand a gesture from another AUV, we propose a deep network (RRCommNet) which exploits a self-attention mechanism to learn to recognize each message by extracting discrimi-temporal features.
arXiv Detail & Related papers (2022-07-12T06:04:12Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand
Gesture Recognition [11.917058689674327]
We introduce a new benchmark dataset named IPN Hand with sufficient size, variety, and real-world elements able to train and evaluate deep neural networks.
This dataset contains more than 4,000 gesture samples and 800,000 RGB frames from 50 distinct subjects.
With our dataset, the performance of three 3D-CNN models is evaluated on the tasks of isolated and continuous real-time HGR.
arXiv Detail & Related papers (2020-04-20T08:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.