Visualizations of Complex Sequences of Family-Infant Vocalizations Using
Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features
- URL: http://arxiv.org/abs/2203.15183v1
- Date: Tue, 29 Mar 2022 01:46:14 GMT
- Title: Visualizations of Complex Sequences of Family-Infant Vocalizations Using
Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features
- Authors: Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain
- Abstract summary: In the U.S., approximately 15-17% of children 2-8 years of age are estimated to have at least one diagnosed mental, behavioral or developmental disorder.
Previous studies have shown advanced ML models excel at classifying infant and/or parent vocalizations collected using cell phone, video, or audio-only recording device like LENA.
We use a bag-of-audio-words method with wav2vec 2.0 features to create high-level visualizations to understand family-infant vocalization interactions.
- Score: 41.07344746812834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the U.S., approximately 15-17% of children 2-8 years of age are estimated
to have at least one diagnosed mental, behavioral or developmental disorder.
However, such disorders often go undiagnosed, and the ability to evaluate and
treat disorders in the first years of life is limited. To analyze infant
developmental changes, previous studies have shown advanced ML models excel at
classifying infant and/or parent vocalizations collected using cell phone,
video, or audio-only recording device like LENA. In this study, we pilot test
the audio component of a new infant wearable multi-modal device that we have
developed called LittleBeats (LB). LB audio pipeline is advanced in that it
provides reliable labels for both speaker diarization and vocalization
classification tasks, compared with other platforms that only record audio
and/or provide speaker diarization labels. We leverage wav2vec 2.0 to obtain
superior and more nuanced results with the LB family audio stream. We use a
bag-of-audio-words method with wav2vec 2.0 features to create high-level
visualizations to understand family-infant vocalization interactions. We
demonstrate that our high-quality visualizations capture major types of family
vocalization interactions, in categories indicative of mental, behavioral, and
developmental health, for both labeled and unlabeled LB audio.
Related papers
- Large Language Model-Enhanced Interactive Agent for Public Education on Newborn Auricular Deformities [14.396700717621085]
Auricular deformities are quite common in newborns with potential long-term negative effects of mental and even hearing problems.
With the help of large language model of Ernie of Baidu Inc., we derive a realization of interactive agent.
It is intelligent enough to detect which type of auricular deformity corresponding to uploaded images.
In terms of popularizing the knowledge of auricular deformities, the agent can give professional suggestions of the illness to parents.
arXiv Detail & Related papers (2024-09-04T01:54:58Z) - Qwen2-Audio Technical Report [73.94975476533989]
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio.
Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis.
arXiv Detail & Related papers (2024-07-15T14:38:09Z) - AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.
Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech.
We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Integration of Text and Graph-based Features for Detecting Mental Health
Disorders from Voice [1.5469452301122175]
Two methods are used to enrich voice analysis for depression detection.
Results suggest that integration of text-based voice classification and learning from low level and graph-based voice signal features can improve the detection of mental disorders like depression.
arXiv Detail & Related papers (2022-05-14T08:37:19Z) - Low-dimensional representation of infant and adult vocalization
acoustics [2.1826796927092214]
We use spectral features extraction and unsupervised machine learning, specifically Uniform Manifold Approximation (UMAP), to obtain a novel 2-dimensional spatial representation of infant and caregiver vocalizations extracted from day-long home recordings.
For instance, we found that the dispersion of infant vocalization acoustics within the 2-D space over a day increased from 3 to 9 months, and then decreased from 9 to 18 months.
arXiv Detail & Related papers (2022-04-25T17:58:13Z) - Classifying Autism from Crowdsourced Semi-Structured Speech Recordings:
A Machine Learning Approach [0.9945783208680666]
We present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments.
We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features; second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0--a state-of-the-art Transformer-based ASR model.
arXiv Detail & Related papers (2022-01-04T01:31:02Z) - Automatic Analysis of the Emotional Content of Speech in Daylong
Child-Centered Recordings from a Neonatal Intensive Care Unit [3.7373314439051106]
Hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia.
We introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data.
We show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall.
arXiv Detail & Related papers (2021-06-14T11:17:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.