SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
Spaces
- URL: http://arxiv.org/abs/2307.12445v2
- Date: Tue, 30 Jan 2024 23:09:40 GMT
- Title: SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
Spaces
- Authors: Ivan Vall\'es-P\'erez, Grzegorz Beringer, Piotr Bilinski, Gary Cook,
Roberto Barra-Chicote
- Abstract summary: We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces.
Results show that the proposed model is sensible to phonetic changes.
We provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications.
- Score: 10.895310812568084
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Numerous examples in the literature proved that deep learning models have the
ability to work well with multimodal data. Recently, CLIP has enabled deep
learning systems to learn shared latent spaces between images and text
descriptions, with outstanding zero- or few-shot results in downstream tasks.
In this paper we explore the same idea proposed by CLIP but applied to the
speech domain, where the phonetic and acoustic spaces usually coexist. We train
a CLIP-based model with the aim to learn shared representations of phonetic and
acoustic spaces. The results show that the proposed model is sensible to
phonetic changes, with a 91% of score drops when replacing 20% of the phonemes
at random, while providing substantial robustness against different kinds of
noise, with a 10% performance drop when mixing the audio with 75% of Gaussian
noise. We also provide empirical evidence showing that the resulting embeddings
are useful for a variety of downstream applications, such as intelligibility
evaluation and the ability to leverage rich pre-trained phonetic embeddings in
speech generation task. Finally, we discuss potential applications with
interesting implications for the speech generation and recognition fields.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Spatial HuBERT: Self-supervised Spatial Speech Representation Learning
for a Single Talker from Multi-channel Audio [7.808211269929968]
This paper presents Spatial HuBERT, a self-supervised speech representation model.
It learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment.
It learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks.
arXiv Detail & Related papers (2023-10-17T01:31:59Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.