Can Self-Supervised Neural Representations Pre-Trained on Human Speech
distinguish Animal Callers?
- URL: http://arxiv.org/abs/2305.14035v3
- Date: Thu, 8 Jun 2023 13:32:43 GMT
- Title: Can Self-Supervised Neural Representations Pre-Trained on Human Speech
distinguish Animal Callers?
- Authors: Eklavya Sarkar and Mathew Magimai.-Doss
- Abstract summary: Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space.
This paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals.
- Score: 23.041173892976325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) models use only the intrinsic structure of a
given signal, independent of its acoustic domain, to extract essential
information from the input to an embedding space. This implies that the utility
of such representations is not limited to modeling human speech alone. Building
on this understanding, this paper explores the cross-transferability of SSL
neural representations learned from human speech to analyze bio-acoustic
signals. We conduct a caller discrimination analysis and a caller detection
study on Marmoset vocalizations using eleven SSL models pre-trained with
various pretext tasks. The results show that the embedding spaces carry
meaningful caller information and can successfully distinguish the individual
identities of Marmoset callers without fine-tuning. This demonstrates that
representations pre-trained on human speech can be effectively applied to the
bio-acoustics domain, providing valuable insights for future investigations in
this field.
Related papers
- On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis [19.205671029694074]
This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks.
Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
arXiv Detail & Related papers (2024-07-23T12:00:44Z) - Evaluating Speaker Identity Coding in Self-supervised Models and Humans [0.42303492200814446]
Speaker identity plays a significant role in human communication and is being increasingly used in societal applications.
We show that self-supervised representations from different families are significantly better for speaker identification over acoustic representations.
We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks.
arXiv Detail & Related papers (2024-06-14T20:07:21Z) - Self-Supervised Learning for Few-Shot Bird Sound Classification [10.395255631261458]
Self-supervised learning (SSL) in audio holds significant potential across various domains.
In this study, we demonstrate that SSL is capable of acquiring meaningful representations of bird sounds from audio recordings without the need for annotations.
arXiv Detail & Related papers (2023-12-25T22:33:45Z) - Neural Sign Actors: A diffusion model for 3D sign language production from text [51.81647203840081]
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities.
This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
arXiv Detail & Related papers (2023-12-05T12:04:34Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Membership Inference Attacks Against Self-supervised Speech Models [62.73937175625953]
Self-supervised learning (SSL) on continuous speech has started gaining attention.
We present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access.
arXiv Detail & Related papers (2021-11-09T13:00:24Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Emotion Recognition of the Singing Voice: Toward a Real-Time Analysis
Tool for Singers [0.0]
Current computational-emotion research has focused on applying acoustic properties to analyze how emotions are perceived mathematically.
This paper seeks to reflect and expand upon the findings of related research and present a stepping-stone toward this end goal.
arXiv Detail & Related papers (2021-05-01T05:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.