Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research
- URL: http://arxiv.org/abs/2305.01965v1
- Date: Wed, 3 May 2023 08:25:37 GMT
- Title: Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research
- Authors: Mar\'ia Andrea Cruz Bland\'on, Alejandrina Cristia, Okko R\"as\"anen
- Abstract summary: Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
- Score: 62.997667081978825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modelling of early language acquisition aims to understand how infants
bootstrap their language skills. The modelling encompasses properties of the
input data used for training the models, the cognitive hypotheses and their
algorithmic implementations being tested, and the evaluation methodologies to
compare models to human data. Recent developments have enabled the use of more
naturalistic training data for computational models. This also motivates
development of more naturalistic tests of model behaviour. A crucial step
towards such an aim is to develop representative speech datasets consisting of
speech heard by infants in their natural environments. However, a major
drawback of such recordings is that they are typically noisy, and it is
currently unclear how the sound quality could affect analyses and modelling
experiments conducted on such data. In this paper, we explore this aspect for
the case of infant-directed speech (IDS) and adult-directed speech (ADS)
analysis. First, we manually and automatically annotated audio quality of
utterances extracted from two corpora of child-centred long-form recordings (in
English and French). We then compared acoustic features of IDS and ADS in an
in-lab dataset and across different audio quality subsets of naturalistic data.
Finally, we assessed how the audio quality and recording environment may change
the conclusions of a modelling analysis using a recent self-supervised learning
model. Our results show that the use of modest and high audio quality
naturalistic speech data result in largely similar conclusions on IDS and ADS
in terms of acoustic analyses and modelling experiments. We also found that an
automatic sound quality assessment tool can be used to screen out useful parts
of long-form recordings for a closer analysis with comparable results to that
of manual quality annotation.
Related papers
- Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Measuring Sound Symbolism in Audio-visual Models [21.876743976994614]
This study investigates whether pre-trained audio-visual models demonstrate associations between sounds and visual representations.
Our findings reveal connections to human language processing, providing insights in cognitive architectures and machine learning strategies.
arXiv Detail & Related papers (2024-09-18T20:33:54Z) - Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders [0.8796261172196743]
We train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection.
We apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions.
We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned.
arXiv Detail & Related papers (2024-06-29T21:14:48Z) - A Comparative Study of Perceptual Quality Metrics for Audio-driven
Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods.
We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness.
Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z) - Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study [33.10311742703679]
We make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM.
Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios.
We benchmark the sound generation task on various frequently-used datasets.
arXiv Detail & Related papers (2023-03-07T12:49:45Z) - Analyzing Robustness of End-to-End Neural Models for Automatic Speech
Recognition [11.489161072526677]
We investigate robustness properties of pre-trained neural models for automatic speech recognition.
In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets.
arXiv Detail & Related papers (2022-08-17T20:00:54Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Generacion de voces artificiales infantiles en castellano con acento
costarricense [0.0]
This article evaluates a first experience of generating artificial children's voices with a Costa Rican accent.
Results show that the intelligibility of the results, evaluated in isolated words, is lower than the voices recorded by the group of participating children.
arXiv Detail & Related papers (2021-02-02T02:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.