Related papers: Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

URL: http://arxiv.org/abs/2212.08489v1
Date: Fri, 16 Dec 2022 14:01:42 GMT
Title: Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
Authors: Esa\'u Villatoro-Tello, Srikanth Madikeri, Juan Zuluaga-Gomez, Bidisha Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlicek, Alexei V. Ivanov, Aravind Ganapathiraju
Abstract summary: We benchmark three types of systems to perform the intent detection task. We evaluate the systems on the publicly available SLURP spoken language resource corpus.
Score: 5.66060067322059
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU systems to improve in comparison to the 1-best setup (4% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, and a relative improvement of 18% over the 1-best configuration. Thus, crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.

Related papers

UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets [21.47194295019577]
Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications.<n>We propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture.
arXiv Detail & Related papers (2025-07-17T09:45:49Z)
Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding [25.986288893402225]
We propose a joint speech recognition and structure learning framework (JSRSL), which can accurately transcribe speech and extract structured content simultaneously. The results show that our proposed method outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities.
arXiv Detail & Related papers (2025-01-13T13:43:46Z)
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z)
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR) We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models [69.35569554213679]
In Spoken Language Understanding (SLU) the task is to extract important information from audio commands. This paper presents a simple method for embedding intents and entities into Finite State Transducers.
arXiv Detail & Related papers (2022-06-29T12:49:53Z)
On Building Spoken Language Understanding Systems for Low Resourced Languages [1.2183405753834562]
We present a series of experiments to explore extremely low-resourced settings. We perform intent classification with systems trained on as low as one data-point per intent and with only one speaker in the dataset. We find that using phonetic transcriptions to make intent classification systems in such low-resourced setting performs significantly better than using speech features.
arXiv Detail & Related papers (2022-05-25T14:44:51Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
Intent Classification Using Pre-Trained Embeddings For Low Resource Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing. We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z)
End-to-End Spoken Language Understanding for Generalized Voice Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs) We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.