Listening for Expert Identified Linguistic Features: Assessment of Audio Deepfake Discernment among Undergraduate Students
- URL: http://arxiv.org/abs/2411.14586v1
- Date: Thu, 21 Nov 2024 20:52:02 GMT
- Title: Listening for Expert Identified Linguistic Features: Assessment of Audio Deepfake Discernment among Undergraduate Students
- Authors: Noshaba N. Bhalli, Nehal Naqvi, Chloe Evered, Christine Mallinson, Vandana P. Janeja,
- Abstract summary: This paper evaluates the impact of training undergraduate students to improve their audio deepfake discernment ability by listening for expert-defined linguistic features.
Our research goes beyond informational training by introducing targeted linguistic cues to listeners as a deepfake discernment mechanism.
Findings show that the experimental group showed a statistically significant decrease in their unsurety when evaluating audio clips and an improvement in their ability to correctly identify clips they were initially unsure about.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper evaluates the impact of training undergraduate students to improve their audio deepfake discernment ability by listening for expert-defined linguistic features. Such features have been shown to improve performance of AI algorithms; here, we ascertain whether this improvement in AI algorithms also translates to improvement of the perceptual awareness and discernment ability of listeners. With humans as the weakest link in any cybersecurity solution, we propose that listener discernment is a key factor for improving trustworthiness of audio content. In this study we determine whether training that familiarizes listeners with English language variation can improve their abilities to discern audio deepfakes. We focus on undergraduate students, as this demographic group is constantly exposed to social media and the potential for deception and misinformation online. To the best of our knowledge, our work is the first study to uniquely address English audio deepfake discernment through such techniques. Our research goes beyond informational training by introducing targeted linguistic cues to listeners as a deepfake discernment mechanism, via a training module. In a pre-/post- experimental design, we evaluated the impact of the training across 264 students as a representative cross section of all students at the University of Maryland, Baltimore County, and across experimental and control sections. Findings show that the experimental group showed a statistically significant decrease in their unsurety when evaluating audio clips and an improvement in their ability to correctly identify clips they were initially unsure about. While results are promising, future research will explore more robust and comprehensive trainings for greater impact.
Related papers
- Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses [0.08155575318208629]
Speech anonymization needs to obscure a speaker's identity while retaining critical information for subsequent tasks.
Our research underscores the importance of loss functions inspired by the human auditory system.
Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations.
arXiv Detail & Related papers (2024-10-20T20:33:44Z) - Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Learning in Audio-visual Context: A Review, Analysis, and New
Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field.
We introduce several key findings that have inspired our computational studies.
We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z) - Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition? [86.53044183309824]
We study which factor leads to the success of self-supervised learning on speaker-related tasks.
Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
arXiv Detail & Related papers (2022-04-27T08:35:57Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Estimating Presentation Competence using Multimodal Nonverbal Behavioral
Cues [7.340483819263093]
Public speaking and presentation competence plays an essential role in many areas of social interaction.
One approach that can promote efficient development of presentation competence is the automated analysis of human behavior during a speech.
In this work, we investigate the contribution of different nonverbal behavioral cues, namely, facial, body pose-based, and audio-related features, to estimate presentation competence.
arXiv Detail & Related papers (2021-05-06T13:09:41Z) - Improving Fairness in Speaker Recognition [4.94706680113206]
We investigate the disparity in performance achieved by state-of-the-art deep speaker recognition systems.
We show that models trained with demographically-balanced training sets exhibit a fairer behavior on different groups, while still being accurate.
arXiv Detail & Related papers (2021-04-29T01:08:53Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.