Generalized zero-shot audio-to-intent classification
- URL: http://arxiv.org/abs/2311.02482v1
- Date: Sat, 4 Nov 2023 18:55:08 GMT
- Title: Generalized zero-shot audio-to-intent classification
- Authors: Veera Raghavendra Elluru, Devang Kulshreshtha, Rohit Paturi, Sravan
Bodapati, Srikanth Ronanki
- Abstract summary: We propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent.
We leverage a neural audio synthesizer to create audio embeddings for sample text utterances.
Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2%.
- Score: 7.76114116227644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding systems using audio-only data are gaining
popularity, yet their ability to handle unseen intents remains limited. In this
study, we propose a generalized zero-shot audio-to-intent classification
framework with only a few sample text sentences per intent. To achieve this, we
first train a supervised audio-to-intent classifier by making use of a
self-supervised pre-trained model. We then leverage a neural audio synthesizer
to create audio embeddings for sample text utterances and perform generalized
zero-shot classification on unseen intents using cosine similarity. We also
propose a multimodal training strategy that incorporates lexical information
into the audio representation to improve zero-shot performance. Our multimodal
training approach improves the accuracy of zero-shot intent classification on
unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal
goal-oriented dialog datasets, respectively, compared to audio-only training.
Related papers
- Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.
We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z) - Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain.
Our model considers acoustic characteristics and sound source information from human-annotated reference texts.
Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z) - Weakly-supervised Automated Audio Captioning via text only training [1.504795651143257]
We propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model.
We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to $83%$ compared to fully supervised approaches.
arXiv Detail & Related papers (2023-09-21T16:40:46Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer
in ASR [13.726142328715897]
We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language.
Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language.
arXiv Detail & Related papers (2021-11-12T16:16:46Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.