Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised
Speech Models
- URL: http://arxiv.org/abs/2210.16043v1
- Date: Fri, 28 Oct 2022 10:26:46 GMT
- Title: Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised
Speech Models
- Authors: Ramon Sanabria, Hao Tang, Sharon Goldwater
- Abstract summary: HuBERT representations with mean-pooling rival the state of the art on English AWEs.
Despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53.
- Score: 30.30385903059709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the strong results of self-supervised models on various tasks, there
have been surprisingly few studies exploring self-supervised representations
for acoustic word embeddings (AWE), fixed-dimensional vectors representing
variable-length spoken word segments. In this work, we study several
pre-trained models and pooling methods for constructing AWEs with
self-supervised representations. Owing to the contextualized nature of
self-supervised representations, we hypothesize that simple pooling methods,
such as averaging, might already be useful for constructing AWEs. When
evaluating on a standard word discrimination task, we find that HuBERT
representations with mean-pooling rival the state of the art on English AWEs.
More surprisingly, despite being trained only on English, HuBERT
representations evaluated on Xitsonga, Mandarin, and French consistently
outperform the multilingual model XLSR-53 (as well as Wav2Vec 2.0 trained on
English).
Related papers
- Distilling Monolingual and Crosslingual Word-in-Context Representations [18.87665111304974]
We propose a method that distils representations of word meaning in context from a pre-trained language model in both monolingual and crosslingual settings.
Our method does not require human-annotated corpora nor updates of the parameters of the pre-trained model.
Our method learns to combine the outputs of different hidden layers of the pre-trained model using self-attention.
arXiv Detail & Related papers (2024-09-13T11:10:16Z) - Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z) - ProsAudit, a prosodic benchmark for self-supervised speech models [14.198508548718676]
ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning (SSL) speech models.
It consists of two subtasks, their corresponding metrics, and an evaluation dataset.
arXiv Detail & Related papers (2023-02-23T14:30:23Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.