Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited
Annotated Data
- URL: http://arxiv.org/abs/2307.01764v1
- Date: Tue, 4 Jul 2023 15:05:42 GMT
- Title: Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited
Annotated Data
- Authors: Guangzhi Sun, Chao Zhang, Ivan Vuli\'c, Pawe{\l} Budzianowski, Philip
C. Woodland
- Abstract summary: We propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, for task-oriented dialogue (ToD) systems.
KA2G achieves robust and data-efficient slot filling for speech-based ToD by 1) framing it as a text generation task, 2) grounding text generation additionally in the audio modality, and 3) conditioning on available external knowledge.
Experiments, conducted on the standard speech-based single-turn SLURP dataset and a multi-turn dataset extracted from a commercial ToD system, display strong and consistent gains.
- Score: 61.89520860387473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manually annotating fine-grained slot-value labels for task-oriented dialogue
(ToD) systems is an expensive and time-consuming endeavour. This motivates
research into slot-filling methods that operate with limited amounts of
labelled data. Moreover, the majority of current work on ToD is based solely on
text as the input modality, neglecting the additional challenges of imperfect
automatic speech recognition (ASR) when working with spoken language. In this
work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling
framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for
ToD with speech input. KA2G achieves robust and data-efficient slot filling for
speech-based ToD by 1) framing it as a text generation task, 2) grounding text
generation additionally in the audio modality, and 3) conditioning on available
external knowledge (e.g. a predefined list of possible slot values). We show
that combining both modalities within the KA2G framework improves the
robustness against ASR errors. Further, the knowledge-aware slot-value
generator in KA2G, implemented via a pointer generator mechanism, particularly
benefits few-shot and zero-shot learning. Experiments, conducted on the
standard speech-based single-turn SLURP dataset and a multi-turn dataset
extracted from a commercial ToD system, display strong and consistent gains
over prior work, especially in few-shot and zero-shot setups.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Text Injection for Capitalization and Turn-Taking Prediction in Speech
Models [45.94388391693112]
This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model.
We show results demonstrating that our text injection method boosts capitalization performance for long-tail data.
arXiv Detail & Related papers (2023-08-14T18:28:04Z) - Slot Induction via Pre-trained Language Model Probing and Multi-level
Contrastive Learning [62.839109775887025]
Slot Induction (SI) task whose objective is to induce slot boundaries without explicit knowledge of token-level slot annotations.
We propose leveraging Unsupervised Pre-trained Language Model (PLM) Probing and Contrastive Learning mechanism to exploit unsupervised semantic knowledge extracted from PLM.
Our approach is shown to be effective in SI task and capable of bridging the gaps with token-level supervised models on two NLU benchmark datasets.
arXiv Detail & Related papers (2023-08-09T05:08:57Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Zero-shot Slot Filling with DPR and RAG [10.577238010892287]
The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents is a long-standing problem in Artificial Intelligence.
Recent advancements in the field try to solve this task in an end-to-end fashion using retrieval-based language models.
In this paper, we describe several strategies we adopted to improve the retriever and the generator of RAG in order to make it a better slot filler.
arXiv Detail & Related papers (2021-04-17T18:24:51Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD [10.168591454648123]
This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
arXiv Detail & Related papers (2021-03-02T11:49:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.