Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations
with Subwords
- URL: http://arxiv.org/abs/2102.02270v1
- Date: Wed, 3 Feb 2021 20:03:50 GMT
- Title: Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations
with Subwords
- Authors: Prashanth Gurunath Shivakumar, Panayiotis Georgiou, Shrikanth
Narayanan
- Abstract summary: Confusion2vec is a word vector representation which encodes ambiguities present in human spoken language.
We show the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language.
- Score: 28.004852127707025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word vector representations enable machines to encode human language for
spoken language understanding and processing. Confusion2vec, motivated from
human speech production and perception, is a word vector representation which
encodes ambiguities present in human spoken language in addition to semantics
and syntactic information. Confusion2vec provides a robust spoken language
representation by considering inherent human language ambiguities. In this
paper, we propose a novel word vector space estimation by unsupervised learning
on lattices output by an automatic speech recognition (ASR) system. We encode
each word in confusion2vec vector space by its constituent subword character
n-grams. We show the subword encoding helps better represent the acoustic
perceptual ambiguities in human spoken language via information modeled on
lattice structured ASR output. The usefulness of the proposed Confusion2vec
representation is evaluated using semantic, syntactic and acoustic analogy and
word similarity tasks. We also show the benefits of subword modeling for
acoustic ambiguity representation on the task of spoken language intent
detection. The results significantly outperform existing word vector
representations when evaluated on erroneous ASR outputs. We demonstrate that
Confusion2vec subword modeling eliminates the need for retraining/adapting the
natural language understanding models on ASR transcripts.
Related papers
- Learning Semantic Information from Raw Audio Signal Using Both
Contextual and Phonetic Representations [18.251845041785906]
We propose a framework to learn semantics from raw audio signals using two types of representations.
We introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions.
For the language model, we adopt a dual-channel architecture to incorporate both types of representation.
arXiv Detail & Related papers (2024-02-02T10:39:58Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Audio-to-Intent Using Acoustic-Textual Subword Representations from
End-to-End ASR [8.832255053182283]
We present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens.
We show that our approach is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.
arXiv Detail & Related papers (2022-10-21T17:45:00Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation
Learning [2.28438857884398]
We present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning spoken-word representations.
STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word.
Latent representations produced by our model were able to predict the target phonetic sequences with an accuracy of 89.47%.
arXiv Detail & Related papers (2020-11-23T13:29:16Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.