Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings
- URL: http://arxiv.org/abs/2209.06633v1
- Date: Wed, 14 Sep 2022 13:33:04 GMT
- Title: Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings
- Authors: Badr M. Abdullah, Bernd M\"obius, Dietrich Klakow
- Abstract summary: We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings.
We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
- Score: 19.195728241989702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models of acoustic word embeddings (AWEs) learn to map variable-length spoken
word segments onto fixed-dimensionality vector representations such that
different acoustic exemplars of the same word are projected nearby in the
embedding space. In addition to their speech technology applications, AWE
models have been shown to predict human performance on a variety of auditory
lexical processing tasks. Current AWE models are based on neural networks and
trained in a bottom-up approach that integrates acoustic cues to build up a
word representation given an acoustic or symbolic supervision signal.
Therefore, these models do not leverage or capture high-level lexical knowledge
during the learning process. % and capture low-level information about word
forms. In this paper, we propose a multi-task learning model that incorporates
top-down lexical knowledge into the training procedure of AWEs. Our model
learns a mapping between the acoustic input and a lexical representation that
encodes high-level information such as word semantics in addition to bottom-up
form-based supervision. We experiment with three languages and demonstrate that
incorporating lexical knowledge improves the embedding space discriminability
and encourages the model to better separate lexical categories.
Related papers
- Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Analyzing the Representational Geometry of Acoustic Word Embeddings [22.677210029168588]
Acoustic word embeddings (AWEs) are vector representations such that different acoustic exemplars of the same word are projected nearby.
This paper takes a closer analytical look at AWEs learned from English speech and study how the choice of the learning objective and the architecture shapes their representational profile.
arXiv Detail & Related papers (2023-01-08T10:22:50Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.