Effectiveness of Text, Acoustic, and Lattice-based representations in
Spoken Language Understanding tasks
- URL: http://arxiv.org/abs/2212.08489v1
- Date: Fri, 16 Dec 2022 14:01:42 GMT
- Title: Effectiveness of Text, Acoustic, and Lattice-based representations in
Spoken Language Understanding tasks
- Authors: Esa\'u Villatoro-Tello, Srikanth Madikeri, Juan Zuluaga-Gomez, Bidisha
Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlicek, Alexei V.
Ivanov, Aravind Ganapathiraju
- Abstract summary: We benchmark three types of systems to perform the intent detection task.
We evaluate the systems on the publicly available SLURP spoken language resource corpus.
- Score: 5.66060067322059
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a Spoken
Language Understanding (SLU) setup. We benchmark three types of systems to
perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a
novel 3) multimodal approach. Our work provides a comprehensive analysis of
what could be the achievable performance of different state-of-the-art SLU
systems under different circumstances, e.g., automatically- vs.
manually-generated transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results indicate that
using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU
systems to improve in comparison to the 1-best setup (4% relative improvement).
However, crossmodal approaches, i.e., learning from acoustic and text
embeddings, obtains performance similar to the oracle setup, and a relative
improvement of 18% over the 1-best configuration. Thus, crossmodal
architectures represent a good alternative to overcome the limitations of
working purely automatically generated textual data.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Finstreder: Simple and fast Spoken Language Understanding with Finite
State Transducers using modern Speech-to-Text models [69.35569554213679]
In Spoken Language Understanding (SLU) the task is to extract important information from audio commands.
This paper presents a simple method for embedding intents and entities into Finite State Transducers.
arXiv Detail & Related papers (2022-06-29T12:49:53Z) - On Building Spoken Language Understanding Systems for Low Resourced
Languages [1.2183405753834562]
We present a series of experiments to explore extremely low-resourced settings.
We perform intent classification with systems trained on as low as one data-point per intent and with only one speaker in the dataset.
We find that using phonetic transcriptions to make intent classification systems in such low-resourced setting performs significantly better than using speech features.
arXiv Detail & Related papers (2022-05-25T14:44:51Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - End-to-End Spoken Language Understanding for Generalized Voice
Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.