Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding
- URL: http://arxiv.org/abs/2106.15065v1
- Date: Tue, 29 Jun 2021 02:53:59 GMT
- Title: Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding
- Authors: Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia,
Florian Metze, Shinji Watanabe, Alan W Black
- Abstract summary: Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
- Score: 101.24748444126982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Spoken intent prediction, for example, combines automatic speech recognition
and natural language understanding. Existing benchmarks, however, typically
hold out examples for only the surface-level sub-task. As a result, models with
similar performance on these benchmarks may have unobserved performance
differences on the other sub-tasks. To allow insightful comparisons between
competitive end-to-end architectures, we propose a framework to construct
robust test sets using coordinate ascent over sub-task specific utility
functions. Given a dataset for a decomposable task, our method optimally
creates a test set for each sub-task to individually assess sub-components of
the end-to-end model. Using spoken language understanding as a case study, we
generate new splits for the Fluent Speech Commands and Snips SmartLights
datasets. Each split has two test sets: one with held-out utterances assessing
natural language understanding abilities, and one with held-out speakers to
test speech processing skills. Our splits identify performance gaps up to 10%
between end-to-end systems that were within 1% of each other on the original
test sets. These performance gaps allow more realistic and actionable
comparisons between different architectures, driving future model development.
We release our splits and tools for the community.
Related papers
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Compositional Exemplars for In-context Learning [21.961094715261133]
Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability.
We propose CEIL (Compositional Exemplars for In-context Learning) to model the interaction between the given input and in-context examples.
We validate CEIL on 12 classification and generation datasets from 7 distinct NLP tasks, including sentiment analysis, paraphrase detection, natural language inference, commonsense reasoning, open-domain question answering, code generation, and semantic parsing.
arXiv Detail & Related papers (2023-02-11T14:02:08Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Coarse-to-Fine: Hierarchical Multi-task Learning for Natural Language
Understanding [51.31622274823167]
We propose a hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks.
This allows our model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks.
arXiv Detail & Related papers (2022-08-19T02:46:20Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.