Capture Salient Historical Information: A Fast and Accurate
Non-Autoregressive Model for Multi-turn Spoken Language Understanding
- URL: http://arxiv.org/abs/2206.12209v1
- Date: Fri, 24 Jun 2022 10:45:32 GMT
- Title: Capture Salient Historical Information: A Fast and Accurate
Non-Autoregressive Model for Multi-turn Spoken Language Understanding
- Authors: Lizhi Cheng, Weijia jia, Wenmian Yang
- Abstract summary: Existing work increases inference speed by designing non-autoregressive models for single-turn Spoken Language Understanding tasks.
We propose a novel model for multi-turn SLU named Salient History Attention with Layer-Refined Transformer (SHA-LRT)
SHA captures historical information for the current dialogue from both historical utterances and results via a well-designed history-attention mechanism.
- Score: 18.988599232838766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken Language Understanding (SLU), a core component of the task-oriented
dialogue system, expects a shorter inference facing the impatience of human
users. Existing work increases inference speed by designing non-autoregressive
models for single-turn SLU tasks but fails to apply to multi-turn SLU in
confronting the dialogue history. The intuitive idea is to concatenate all
historical utterances and utilize the non-autoregressive models directly.
However, this approach seriously misses the salient historical information and
suffers from the uncoordinated-slot problems. To overcome those shortcomings,
we propose a novel model for multi-turn SLU named Salient History Attention
with Layer-Refined Transformer (SHA-LRT), which composes of an SHA module, a
Layer-Refined Mechanism (LRM), and a Slot Label Generation (SLG) task. SHA
captures salient historical information for the current dialogue from both
historical utterances and results via a well-designed history-attention
mechanism. LRM predicts preliminary SLU results from Transformer's middle
states and utilizes them to guide the final prediction, and SLG obtains the
sequential dependency information for the non-autoregressive encoder.
Experiments on public datasets indicate that our model significantly improves
multi-turn SLU performance (17.5% on Overall) with accelerating (nearly 15
times) the inference process over the state-of-the-art baseline as well as
effective on the single-turn SLU tasks.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence [68.27280750612204]
We introduce the large auto-regressive model (LARM) for embodied agents.
LARM uses both text and multi-view images as input and predicts subsequent actions in an auto-regressive manner.
Adopting a two-phase training regimen, LARM successfully harvests enchanted equipment in Minecraft.
arXiv Detail & Related papers (2024-05-27T17:59:32Z) - A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models
for Spoken Language Understanding [42.345266746904514]
We employ four types of pre-trained models and their combinations for spoken language understanding (SLU)
We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations.
We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora.
arXiv Detail & Related papers (2022-11-10T20:59:13Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - An Effective Non-Autoregressive Model for Spoken Language Understanding [15.99246711701726]
We propose a novel non-autoregressive Spoken Language Understanding model named Layered-Refine Transformer.
With SLG, the non-autoregressive model can efficiently obtain dependency information during training and spend no extra time in inference.
Experiments on two public datasets indicate that our model significantly improves SLU performance (1.5% on Overall accuracy) while substantially speed up (more than 10 times) the inference process.
arXiv Detail & Related papers (2021-08-16T10:26:57Z) - A Result based Portable Framework for Spoken Language Understanding [15.99246711701726]
We propose a novel Result-based Portable Framework for Spoken language understanding (RPFSLU)
RPFSLU allows most existing single-turn SLU models to obtain the contextual information from multi-turn dialogues and takes full advantage of predicted results in the dialogue history during the current prediction.
Experimental results on the public dataset KVRET have shown that all SLU models in baselines acquire enhancement by RPFSLU on multi-turn SLU tasks.
arXiv Detail & Related papers (2021-03-10T12:06:26Z) - Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting.
HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information.
Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.