DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with
Feature selection
- URL: http://arxiv.org/abs/2007.06809v2
- Date: Tue, 21 Jul 2020 05:55:02 GMT
- Title: DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with
Feature selection
- Authors: Ehsan Asali, Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, Prasanth
Sengadu Suresh, and Hamid R. Arabnia
- Abstract summary: We propose DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection.
We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images.
The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream.
- Score: 2.495606047371841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For recognizing speakers in video streams, significant research studies have
been made to obtain a rich machine learning model by extracting high-level
speaker's features such as facial expression, emotion, and gender. However,
generating such a model is not feasible by using only single modality feature
extractors that exploit either audio signals or image frames, extracted from
video streams. In this paper, we address this problem from a different
perspective and propose an unprecedented multimodality data fusion framework
called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We
execute DeepMSRF by feeding features of the two modalities, namely speakers'
audios and face images. DeepMSRF uses a two-stream VGGNET to train on both
modalities to reach a comprehensive model capable of accurately recognizing the
speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its
metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the
gender of the speaker first, and further to recognize his or her name for any
given video stream. The experimental results illustrate that DeepMSRF
outperforms single modality speaker recognition methods with at least 3 percent
accuracy.
Related papers
- MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - TranssionADD: A multi-frame reinforcement based sequence tagging model
for audio deepfake detection [11.27584658526063]
The second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances.
We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers.
Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system.
arXiv Detail & Related papers (2023-06-27T05:18:25Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - Multi-modal Residual Perceptron Network for Audio-Video Emotion
Recognition [0.22843885788439797]
We propose a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise.
For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4%.
arXiv Detail & Related papers (2021-07-21T13:11:37Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.