Low-dimensional representation of infant and adult vocalization
acoustics
- URL: http://arxiv.org/abs/2204.12279v1
- Date: Mon, 25 Apr 2022 17:58:13 GMT
- Title: Low-dimensional representation of infant and adult vocalization
acoustics
- Authors: Silvia Pagliarini, Sara Schneider, Christopher T. Kello, Anne S.
Warlaumont
- Abstract summary: We use spectral features extraction and unsupervised machine learning, specifically Uniform Manifold Approximation (UMAP), to obtain a novel 2-dimensional spatial representation of infant and caregiver vocalizations extracted from day-long home recordings.
For instance, we found that the dispersion of infant vocalization acoustics within the 2-D space over a day increased from 3 to 9 months, and then decreased from 9 to 18 months.
- Score: 2.1826796927092214
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: During the first years of life, infant vocalizations change considerably, as
infants develop the vocalization skills that enable them to produce speech
sounds. Characterizations based on specific acoustic features, protophone
categories, or phonetic transcription are able to provide a representation of
the sounds infants make at different ages and in different contexts but do not
fully describe how sounds are perceived by listeners, can be inefficient to
obtain at large scales, and are difficult to visualize in two dimensions
without additional statistical processing. Machine-learning-based approaches
provide the opportunity to complement these characterizations with purely
data-driven representations of infant sounds. Here, we use spectral features
extraction and unsupervised machine learning, specifically Uniform Manifold
Approximation (UMAP), to obtain a novel 2-dimensional spatial representation of
infant and caregiver vocalizations extracted from day-long home recordings.
UMAP yields a continuous and well-distributed space conducive to certain
analyses of infant vocal development. For instance, we found that the
dispersion of infant vocalization acoustics within the 2-D space over a day
increased from 3 to 9 months, and then decreased from 9 to 18 months. The
method also permits analysis of similarity between infant and adult
vocalizations, which also shows changes with infant age.
Related papers
- Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations [0.0]
Based on audio recordings made once a month during the first 12 months of a child's life, we propose a new method for clustering this set of vocalizations.
We use a topologically augmented representation of the vocalizations, employing two persistence diagrams for each vocalization.
Our findings reveal the presence of 8 clusters of vocalizations, allowing us to compare their temporal distribution and acoustic profiles in the first 12 months of life.
arXiv Detail & Related papers (2024-07-08T09:12:52Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Visualizations of Complex Sequences of Family-Infant Vocalizations Using
Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features [41.07344746812834]
In the U.S., approximately 15-17% of children 2-8 years of age are estimated to have at least one diagnosed mental, behavioral or developmental disorder.
Previous studies have shown advanced ML models excel at classifying infant and/or parent vocalizations collected using cell phone, video, or audio-only recording device like LENA.
We use a bag-of-audio-words method with wav2vec 2.0 features to create high-level visualizations to understand family-infant vocalization interactions.
arXiv Detail & Related papers (2022-03-29T01:46:14Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Automatic Analysis of the Emotional Content of Speech in Daylong
Child-Centered Recordings from a Neonatal Intensive Care Unit [3.7373314439051106]
Hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia.
We introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data.
We show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall.
arXiv Detail & Related papers (2021-06-14T11:17:52Z) - Convolutional Neural Network-Based Age Estimation Using B-Mode
Ultrasound Tongue Image [10.100437437151621]
We explore the feasibility of age estimation using the ultrasound tongue image of the speakers.
Motivated by the success of deep learning, this paper leverages deep learning on this task.
The developed method can be used a tool to evaluate the performance of speech therapy sessions.
arXiv Detail & Related papers (2021-01-27T08:00:47Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Data-driven Detection and Analysis of the Patterns of Creaky Voice [13.829936505895692]
Creaky voice is a quality frequently used as a phrase-boundary marker.
The automatic detection and modelling of creaky voice may have implications for speech technology applications.
arXiv Detail & Related papers (2020-05-31T13:34:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.