Analyzing autoencoder-based acoustic word embeddings
- URL: http://arxiv.org/abs/2004.01647v1
- Date: Fri, 3 Apr 2020 16:11:57 GMT
- Title: Analyzing autoencoder-based acoustic word embeddings
- Authors: Yevgen Matusevych, Herman Kamper, Sharon Goldwater
- Abstract summary: Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
- Score: 37.78342106714364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have introduced methods for learning acoustic word embeddings
(AWEs)---fixed-size vector representations of words which encode their acoustic
features. Despite the widespread use of AWEs in speech processing research,
they have only been evaluated quantitatively in their ability to discriminate
between whole word tokens. To better understand the applications of AWEs in
various downstream tasks and in cognitive modeling, we need to analyze the
representation spaces of AWEs. Here we analyze basic properties of AWE spaces
learned by a sequence-to-sequence encoder-decoder model in six typologically
diverse languages. We first show that these AWEs preserve some information
about words' absolute duration and speaker. At the same time, the
representation space of these AWEs is organized such that the distance between
words' embeddings increases with those words' phonetic dissimilarity. Finally,
the AWEs exhibit a word onset bias, similar to patterns reported in various
studies on human speech processing and lexical access. We argue this is a
promising result and encourage further evaluation of AWEs as a potentially
useful tool in cognitive science, which could provide a link between speech
processing and lexical memory.
Related papers
- Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Audio-to-Intent Using Acoustic-Textual Subword Representations from
End-to-End ASR [8.832255053182283]
We present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens.
We show that our approach is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.
arXiv Detail & Related papers (2022-10-21T17:45:00Z) - Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings.
We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language.
This study proposes a novel fully unsupervised learning method for discovering speech units.
The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.