Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language
Understanding
- URL: http://arxiv.org/abs/2112.06743v1
- Date: Mon, 13 Dec 2021 15:49:36 GMT
- Title: Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language
Understanding
- Authors: Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra,
Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj,
Ariya Rastrow, Grant P. Strimel
- Abstract summary: We propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts.
Our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively.
- Score: 14.157311972146692
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent years have seen significant advances in end-to-end (E2E) spoken
language understanding (SLU) systems, which directly predict intents and slots
from spoken audio. While dialogue history has been exploited to improve
conventional text-based natural language understanding systems, current E2E SLU
approaches have not yet incorporated such critical contextual signals in
multi-turn and task-oriented dialogues. In this work, we propose a contextual
E2E SLU model architecture that uses a multi-head attention mechanism over
encoded previous utterances and dialogue acts (actions taken by the voice
assistant) of a multi-turn dialogue. We detail alternative methods to integrate
these contexts into the state-ofthe-art recurrent and transformer-based models.
When applied to a large de-identified dataset of utterances collected by a
voice assistant, our method reduces average word and semantic error rates by
10.8% and 12.6%, respectively. We also present results on a publicly available
dataset and show that our method significantly improves performance over a
noncontextual baseline
Related papers
- On the Use of Audio to Improve Dialogue Policies [9.35212661749004]
We propose new architectures to add audio information by combining speech and text embeddings.
Experiments show that audio embedding-aware dialogue policies outperform text-based ones.
arXiv Detail & Related papers (2024-10-17T09:37:20Z) - SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization [48.284512017469524]
Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations.
Traditional language models often overlook the distinct features of these dialogues by treating them as regular text.
We propose a speaker-enhanced pre-training method for long dialogue summarization.
arXiv Detail & Related papers (2024-01-31T04:50:00Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Knowledge Augmented BERT Mutual Network in Multi-turn Spoken Dialogues [6.4144180888492075]
We propose to equip a BERT-based joint model with a knowledge attention module to mutually leverage dialogue contexts between two SLU tasks.
A gating mechanism is further utilized to filter out irrelevant knowledge triples and to circumvent distracting comprehension.
Experimental results in two complicated multi-turn dialogue datasets have demonstrate by mutually modeling two SLU tasks with filtered knowledge and dialogue contexts.
arXiv Detail & Related papers (2022-02-23T04:03:35Z) - A Context-Aware Hierarchical BERT Fusion Network for Multi-turn Dialog
Act Detection [6.361198391681688]
CaBERT-SLU is a context-aware hierarchical BERT fusion Network (CaBERT-SLU)
Our approach reaches new state-of-the-art (SOTA) performances in two complicated multi-turn dialogue datasets.
arXiv Detail & Related papers (2021-09-03T02:00:03Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.