Pervasive Hand Gesture Recognition for Smartphones using Non-audible
Sound and Deep Learning
- URL: http://arxiv.org/abs/2108.02148v1
- Date: Wed, 4 Aug 2021 16:23:26 GMT
- Title: Pervasive Hand Gesture Recognition for Smartphones using Non-audible
Sound and Deep Learning
- Authors: Ahmed Ibrahim, Ayman El-Refai, Sara Ahmed, Mariam Aboul-Ela, Hesham M.
Eraqi, Mohamed Moustafa
- Abstract summary: This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones.
The proposed system emits an ultrasonic sonar-based signal (inaudible sound) from the smartphone's stereo speakers, which is then received by the smartphone's microphone and processed via a Convolutional Neural Network (CNN) for Hand Gesture Recognition.
- Score: 1.529170372164118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the mass advancement in ubiquitous technologies nowadays, new
pervasive methods have come into the practice to provide new innovative
features and stimulate the research on new human-computer interactions. This
paper presents a hand gesture recognition method that utilizes the smartphone's
built-in speakers and microphones. The proposed system emits an ultrasonic
sonar-based signal (inaudible sound) from the smartphone's stereo speakers,
which is then received by the smartphone's microphone and processed via a
Convolutional Neural Network (CNN) for Hand Gesture Recognition. Data
augmentation techniques are proposed to improve the detection accuracy and
three dual-channel input fusion methods are compared. The first method merges
the dual-channel audio as a single input spectrogram image. The second method
adopts early fusion by concatenating the dual-channel spectrograms. The third
method adopts late fusion by having two convectional input branches processing
each of the dual-channel spectrograms and then the outputs are merged by the
last layers. Our experimental results demonstrate a promising detection
accuracy for the six gestures presented in our publicly available dataset with
an accuracy of 93.58\% as a baseline.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Learning to Rank Microphones for Distant Speech Recognition [16.47293353050145]
Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition.
Current channel selection techniques either rely on signal, decoder or posterior-based features.
We propose MicRank, a learning to rank framework where a neural network is trained to rank the available channels.
arXiv Detail & Related papers (2021-04-06T22:39:30Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with
Feature selection [2.495606047371841]
We propose DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection.
We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images.
The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream.
arXiv Detail & Related papers (2020-07-14T04:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.