Multilingual Audio-Visual Smartphone Dataset And Evaluation
- URL: http://arxiv.org/abs/2109.04138v1
- Date: Thu, 9 Sep 2021 09:52:37 GMT
- Title: Multilingual Audio-Visual Smartphone Dataset And Evaluation
- Authors: Hareesh Mandalapu, Aravinda Reddy P N, Raghavendra Ramachandra, K
Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch
- Abstract summary: We present an audio-visual smartphone dataset captured in five different recent smartphones.
Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems.
We also report the performance of the bench-marked biometric verification systems on our dataset.
- Score: 35.82191448400655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Smartphones have been employed with biometric-based verification systems to
provide security in highly sensitive applications. Audio-visual biometrics are
getting popular due to the usability and also it will be challenging to spoof
because of multi-modal nature. In this work, we present an audio-visual
smartphone dataset captured in five different recent smartphones. This new
dataset contains 103 subjects captured in three different sessions considering
the different real-world scenarios. Three different languages are acquired in
this dataset to include the problem of language dependency of the speaker
recognition systems. These unique characteristics of this dataset will pave the
way to implement novel state-of-the-art unimodal or audio-visual speaker
recognition systems. We also report the performance of the bench-marked
biometric verification systems on our dataset. The robustness of biometric
algorithms is evaluated towards multiple dependencies like signal noise,
device, language and presentation attacks like replay and synthesized signals
with extensive experiments. The obtained results raised many concerns about the
generalization properties of state-of-the-art biometrics methods in
smartphones.
Related papers
- Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data.
We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities?
Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios?
Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Developing Acoustic Models for Automatic Speech Recognition in Swedish [6.5458610824731664]
This paper is concerned with automatic continuous speech recognition using trainable systems.
The aim of this work is to build acoustic models for spoken Swedish.
arXiv Detail & Related papers (2024-04-25T12:03:14Z) - Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z) - Language identification as improvement for lip-based biometric visual
systems [13.205817167773443]
We present a preliminary study in which we use linguistic information as a soft biometric trait to enhance the performance of a visual (auditory-free) identification system based on lip movement.
We report a significant improvement in the identification performance of the proposed visual system as a result of the integration of these data.
arXiv Detail & Related papers (2023-02-27T15:44:24Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Mobile Behavioral Biometrics for Passive Authentication [65.94403066225384]
This work carries out a comparative analysis of unimodal and multimodal behavioral biometric traits.
Experiments are performed over HuMIdb, one of the largest and most comprehensive freely available mobile user interaction databases.
In our experiments, the most discriminative background sensor is the magnetometer, whereas among touch tasks the best results are achieved with keystroke.
arXiv Detail & Related papers (2022-03-14T17:05:59Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered
Language for Universal Phone Recognition Experiments [7.286387368812729]
This paper presents a publicly available, phonetically transcribed corpus of 2255 utterances in the endangered Tangkhulic language East Tusom.
Because the dataset is transcribed in terms of phones, rather than phonemes, it is a better match for universal phone recognition systems than many larger datasets.
arXiv Detail & Related papers (2021-04-02T00:26:10Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - Few Shot Text-Independent speaker verification using 3D-CNN [0.0]
We have proposed a novel method to verify the identity of the claimed speaker using very few training data.
Experiments conducted on the VoxCeleb1 dataset show that the proposed model accuracy even on training with very few data is near to the state of the art model on text-independent speaker verification.
arXiv Detail & Related papers (2020-08-25T15:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.