Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World
- URL: http://arxiv.org/abs/2412.19537v1
- Date: Fri, 27 Dec 2024 09:04:04 GMT
- Title: Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World
- Authors: Meiqi Wu, Kaiqi Huang, Yuanqiang Cai, Shiyu Hu, Yuzhong Zhao, Weiqiang Wang,
- Abstract summary: We present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024)
This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras.
We also introduce our baseline approach, the video-based character recognizer (VCRec)
- Score: 45.972735599458446
- License:
- Abstract: Air-writing is a challenging task that combines the fields of computer vision and natural language processing, offering an intuitive and natural approach for human-computer interaction. However, current air-writing solutions face two primary challenges: (1) their dependency on complex sensors (e.g., Radar, EEGs and others) for capturing precise handwritten trajectories, and (2) the absence of a video-based air-writing dataset that covers a comprehensive vocabulary range. These limitations impede their practicality in various real-world scenarios, including the use on devices like iPhones and laptops. To tackle these challenges, we present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024), serving as a pioneering benchmark for video-based air-writing. This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras, eliminating the need for complex sensors. AWCV-100K-UCAS2024 includes 8.8 million video frames, encompassing the complete set of 3,755 characters from the GB2312-80 level-1 set (GB1). Furthermore, we introduce our baseline approach, the video-based character recognizer (VCRec). VCRec adeptly extracts fingertip features from sparse visual cues and employs a spatio-temporal sequence module for analysis. Experimental results showcase the superior performance of VCRec compared to existing models in recognizing air-written characters, both quantitatively and qualitatively. This breakthrough paves the way for enhanced human-computer interaction in real-world contexts. Moreover, our approach leverages affordable RGB cameras, enabling its applicability in a diverse range of scenarios. The code and data examples will be made public at https://github.com/wmeiqi/AWCV.
Related papers
- Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms [29.577583619354314]
We propose a large-scale, high-definition ($1280 times 800$) human action recognition dataset based on the CeleX-V event camera.
To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare.
arXiv Detail & Related papers (2024-08-19T07:52:20Z) - Open-World Human-Object Interaction Detection via Multi-modal Prompts [26.355054079885463]
MP-HOI is a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions.
MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times.
arXiv Detail & Related papers (2024-06-11T13:01:45Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant
Features [0.0]
Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics.
This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023.
arXiv Detail & Related papers (2023-12-06T08:58:11Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - To show or not to show: Redacting sensitive text from videos of
electronic displays [4.621328863799446]
We define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques.
We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV)
arXiv Detail & Related papers (2022-08-19T07:53:04Z) - SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action
Recognition Challenge 2021 [80.05652375838073]
This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021.
Our submission, visible on the public leaderboard, achieved a top-1 action recognition accuracy of 44.82%, using only RGB.
arXiv Detail & Related papers (2021-10-06T16:29:47Z) - Writing in The Air: Unconstrained Text Recognition from Finger Movement
Using Spatio-Temporal Convolution [3.3502165500990824]
In this paper, we introduce a new benchmark dataset for the challenging writing in the air (WiTA) task.
WiTA implements an intuitive and natural writing method with finger movement for human-computer interaction.
Our dataset consists of five sub-datasets in two languages (Korean and English) and amounts to 209,926 instances from 122 participants.
arXiv Detail & Related papers (2021-04-19T02:37:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.