Related papers: Speaker Recognition in Realistic Scenario Using Multimodal Data

Speaker Recognition in Realistic Scenario Using Multimodal Data

URL: http://arxiv.org/abs/2302.13033v1
Date: Sat, 25 Feb 2023 09:11:09 GMT
Title: Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Abstract summary: We propose a two-branch network to learn joint representations of faces and voices in a multimodal system. We evaluate our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$.
Score: 4.373374186532439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.

Related papers

Language-based Audio Retrieval with Co-Attention Networks [22.155383794829977]
We introduce a novel framework for the language-based audio retrieval task. We propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method.
arXiv Detail & Related papers (2024-12-30T12:49:55Z)
VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset. We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z)
Audio-video fusion strategies for active speaker detection in meetings [5.61861182374067]
We propose two types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks. For our application context, adding motion information greatly improves performance. We have shown that attention-based fusion improves performance while reducing the standard deviation.
arXiv Detail & Related papers (2022-06-09T08:20:52Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.