SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech
- URL: http://arxiv.org/abs/2111.10367v1
- Date: Fri, 19 Nov 2021 18:59:23 GMT
- Title: SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech
- Authors: Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen
Livescu, Kyu J. Han
- Abstract summary: We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
- Score: 44.68649535280397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Progress in speech processing has been facilitated by shared datasets and
benchmarks. Historically these have focused on automatic speech recognition
(ASR), speaker identification, or other lower-level tasks. Interest has been
growing in higher-level spoken language understanding tasks, including using
end-to-end models, but there are fewer annotated datasets for such tasks. At
the same time, recent work shows the possibility of pre-training generic
representations and then fine-tuning for several tasks using relatively little
labeled data. We propose to create a suite of benchmark tasks for Spoken
Language Understanding Evaluation (SLUE) consisting of limited-size labeled
training sets and corresponding evaluation sets. This resource would allow the
research community to track progress, evaluate pre-trained representations for
higher-level tasks, and study open questions such as the utility of pipeline
versus end-to-end approaches. We present the first phase of the SLUE benchmark
suite, consisting of named entity recognition, sentiment analysis, and ASR on
the corresponding datasets. We focus on naturally produced (not read or
synthesized) speech, and freely available datasets. We provide new
transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli
datasets, evaluation metrics and results for baseline models, and an
open-source toolkit to reproduce the baselines and evaluate new models.
Related papers
- XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - Quantifying the Task-Specific Information in Text-Based Classifications [20.148222318025528]
Shortcuts in datasets do not contribute to the *task-specific information* (TSI) of the classification tasks.
In this paper, we consider how much task-specific information is required to classify a dataset.
This framework allows us to compare across datasets, saying that, apart from a set of shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.
arXiv Detail & Related papers (2021-10-17T21:54:38Z) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z) - Towards Learning a Universal Non-Semantic Representation of Speech [18.54874934311111]
This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective.
The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.
arXiv Detail & Related papers (2020-02-25T21:38:24Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.