Related papers: A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding

A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding

URL: http://arxiv.org/abs/2204.03879v1
Date: Fri, 8 Apr 2022 07:12:11 GMT
Title: A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding
Authors: Nick J.C. Wang, Shaojun Wang, Jing Xiao
Abstract summary: We compare different ways to combine ASR and NLU, in particular using a single Conformer model. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application.
Score: 25.41993752756759
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application. System optimization still entails carefully improving the performance of each component. It is difficult to prove that one direction is conclusively better than the other. In this paper, we also propose a novel connectionist temporal summarization (CTS) method to reduce the length of acoustic encoding sequences while improving the accuracy and processing speed of end-to-end models. This method achieves the same intent accuracy as the best two-stage SLU recognition with complicated and time-consuming decoding but does so at lower computational cost. This stacked end-to-end SLU system yields an intent accuracy of 93.97% for the SmartLights far-field set, 95.18% for the close-field set, and 99.71% for FluentSpeech.

Related papers

OPTDTALS: Approximate Logic Synthesis via Optimal Decision Trees Approach [9.081146426124482]
Approximate Logic Synthesis (ALS) aims to reduce circuit complexity by sacrificing correctness. We propose a new ALS methodology realizing approximation via learning optimal decision trees in empirical accuracy.
arXiv Detail & Related papers (2024-08-22T11:23:58Z)
Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions. We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z)
Bridging the Gap Between End-to-End and Two-Step Text Spotting [88.14552991115207]
Bridging Text Spotting is a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods. We demonstrate the effectiveness of the proposed method through extensive experiments.
arXiv Detail & Related papers (2024-04-06T13:14:04Z)
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR) We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z)
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale [64.10124092250126]
Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
arXiv Detail & Related papers (2023-04-19T18:09:27Z)
Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks [5.66060067322059]
We benchmark three types of systems to perform the intent detection task. We evaluate the systems on the publicly available SLURP spoken language resource corpus.
arXiv Detail & Related papers (2022-12-16T14:01:42Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair. We propose a novel architecture with cross modality augmentation. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
Intelligent and Reconfigurable Architecture for KL Divergence Based Online Machine Learning Algorithm [0.0]
Online machine learning (OML) algorithms do not need any training phase and can be deployed directly in an unknown environment. Online machine learning (OML) algorithms do not need any training phase and can be deployed directly in an unknown environment.
arXiv Detail & Related papers (2020-02-18T16:39:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.