Audio-to-Intent Using Acoustic-Textual Subword Representations from
End-to-End ASR
- URL: http://arxiv.org/abs/2210.12134v1
- Date: Fri, 21 Oct 2022 17:45:00 GMT
- Title: Audio-to-Intent Using Acoustic-Textual Subword Representations from
End-to-End ASR
- Authors: Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan
Niu, Ahmed Tewfik
- Abstract summary: We present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens.
We show that our approach is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.
- Score: 8.832255053182283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate prediction of the user intent to interact with a voice assistant
(VA) on a device (e.g. on the phone) is critical for achieving naturalistic,
engaging, and privacy-centric interactions with the VA. To this end, we present
a novel approach to predict the user's intent (the user speaking to the device
or not) directly from acoustic and textual information encoded at subword
tokens which are obtained via an end-to-end ASR model. Modeling directly the
subword tokens, compared to modeling of the phonemes and/or full words, has at
least two advantages: (i) it provides a unique vocabulary representation, where
each token has a semantic meaning, in contrast to the phoneme-level
representations, (ii) each subword token has a reusable "sub"-word acoustic
pattern (that can be used to construct multiple full words), resulting in a
largely reduced vocabulary space than of the full words. To learn the subword
representations for the audio-to-intent classification, we extract: (i)
acoustic information from an E2E-ASR model, which provides frame-level CTC
posterior probabilities for the subword tokens, and (ii) textual information
from a pre-trained continuous bag-of-words model capturing the semantic meaning
of the subword tokens. The key to our approach is the way it combines acoustic
subword-level posteriors with text information using the notion of
positional-encoding in order to account for multiple ASR hypotheses
simultaneously. We show that our approach provides more robust and richer
representations for audio-to-intent classification, and is highly accurate with
correctly mitigating 93.3% of unintended user audio from invoking the smart
assistant at 99% true positive rate.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition [46.675712485821805]
Subword units are commonly used for end-to-end automatic speech recognition (ASR)
We propose an acoustic data-driven subword modeling approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline.
arXiv Detail & Related papers (2021-04-19T07:54:15Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation
Learning [2.28438857884398]
We present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning spoken-word representations.
STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word.
Latent representations produced by our model were able to predict the target phonetic sequences with an accuracy of 89.47%.
arXiv Detail & Related papers (2020-11-23T13:29:16Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.