Related papers: Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning

Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning

URL: http://arxiv.org/abs/2108.02148v1
Date: Wed, 4 Aug 2021 16:23:26 GMT
Title: Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning
Authors: Ahmed Ibrahim, Ayman El-Refai, Sara Ahmed, Mariam Aboul-Ela, Hesham M. Eraqi, Mohamed Moustafa
Abstract summary: This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudible sound) from the smartphone's stereo speakers, which is then received by the smartphone's microphone and processed via a Convolutional Neural Network (CNN) for Hand Gesture Recognition.
Score: 1.529170372164118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to the mass advancement in ubiquitous technologies nowadays, new pervasive methods have come into the practice to provide new innovative features and stimulate the research on new human-computer interactions. This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudible sound) from the smartphone's stereo speakers, which is then received by the smartphone's microphone and processed via a Convolutional Neural Network (CNN) for Hand Gesture Recognition. Data augmentation techniques are proposed to improve the detection accuracy and three dual-channel input fusion methods are compared. The first method merges the dual-channel audio as a single input spectrogram image. The second method adopts early fusion by concatenating the dual-channel spectrograms. The third method adopts late fusion by having two convectional input branches processing each of the dual-channel spectrograms and then the outputs are merged by the last layers. Our experimental results demonstrate a promising detection accuracy for the six gestures presented in our publicly available dataset with an accuracy of 93.58\% as a baseline.

Related papers

Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model [1.5566524830295307]
This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions. A hybrid model combining residual convolutional neural networks and bidirectional gated units (BiGRU) is proposed. Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets.
arXiv Detail & Related papers (2024-12-14T15:11:42Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos. We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input. We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z)
Learning to Rank Microphones for Distant Speech Recognition [16.47293353050145]
Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition. Current channel selection techniques either rely on signal, decoder or posterior-based features. We propose MicRank, a learning to rank framework where a neural network is trained to rank the available channels.
arXiv Detail & Related papers (2021-04-06T22:39:30Z)
Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize. We propose to utilize the high-frequency noises for face forgery detection. The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales. The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z)
DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection [2.495606047371841]
We propose DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream.
arXiv Detail & Related papers (2020-07-14T04:28:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.