SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks
- URL: http://arxiv.org/abs/2212.10525v2
- Date: Thu, 15 Jun 2023 22:51:09 GMT
- Title: SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks
- Authors: Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu,
Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
- Abstract summary: Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
- Score: 88.4408774253634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding (SLU) tasks have been studied for many decades
in the speech research community, but have not received as much attention as
lower-level tasks like speech and speaker recognition. In particular, there are
not nearly as many SLU task benchmarks, and many of the existing ones use data
that is not freely available to all researchers. Recent work has begun to
introduce such benchmark datasets for several tasks. In this work, we introduce
several new annotated SLU benchmark tasks based on freely available speech
data, which complement existing benchmarks and address gaps in the SLU
evaluation landscape. We contribute four tasks: question answering and
summarization involve inference over longer speech sequences; named entity
localization addresses the speech-specific task of locating the targeted
content in the signal; dialog act classification identifies the function of a
given speech utterance. We follow the blueprint of the Spoken Language
Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the
development of SLU models that leverage the success of pre-trained speech
representations, we will be publishing for each task (i) annotations for a
relatively small fine-tuning set, (ii) annotated development and test sets, and
(iii) baseline models for easy reproducibility and comparisons. In this work,
we present the details of data collection and annotation and the performance of
the baseline models. We also perform sensitivity analysis of pipeline models'
performance (speech recognizer + text model) to the speech recognition
accuracy, using more than 20 state-of-the-art speech recognition models.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - GRASS: Unified Generation Model for Speech-to-Semantic Tasks [7.044414457214718]
We introduce a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data.
Our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more.
To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
arXiv Detail & Related papers (2023-09-06T06:44:26Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.