NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language
Evaluation
- URL: http://arxiv.org/abs/2111.05196v1
- Date: Tue, 9 Nov 2021 15:09:06 GMT
- Title: NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language
Evaluation
- Authors: David Alfonso-Hermelo, Ahmad Rashid, Abbas Ghaddar, Philippe Langlais,
Mehdi Rezagholizadeh
- Abstract summary: Slot-filling and intent detection are the backbone of conversational agents such as voice assistants, and are active areas of research.
Even though state-of-the-art techniques on publicly available benchmarks show impressive performance, their ability to generalize to realistic scenarios is yet to be demonstrated.
We present, a set of simple spoken-language oriented transformations, applied to the evaluation set of datasets to introduce human spoken language variations while preserving the semantics of an utterance.
- Score: 7.813460653362095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Slot-filling and intent detection are the backbone of conversational agents
such as voice assistants, and are active areas of research. Even though
state-of-the-art techniques on publicly available benchmarks show impressive
performance, their ability to generalize to realistic scenarios is yet to be
demonstrated. In this work, we present NATURE, a set of simple spoken-language
oriented transformations, applied to the evaluation set of datasets, to
introduce human spoken language variations while preserving the semantics of an
utterance. We apply NATURE to common slot-filling and intent detection
benchmarks and demonstrate that simple perturbations from the standard
evaluation set by NATURE can deteriorate model performance significantly.
Through our experiments we demonstrate that when NATURE operators are applied
to evaluation set of popular benchmarks the model accuracy can drop by up to
40%.
Related papers
- CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment [13.74065648648307]
We propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models for speech-based cognitive impairment assessment.<n>Our results show that conventional deep learning models degrade substantially when transferred across domains.<n>Our findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.
arXiv Detail & Related papers (2025-08-05T12:06:16Z) - TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios [47.08170350061827]
Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance.<n>Most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs)<n>We propose a benchmark specifically designed to evaluate SLMs' effectiveness as conversational agents in realistic Chinese interactive settings.
arXiv Detail & Related papers (2025-07-24T03:23:55Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Locally Typical Sampling [84.62530743899025]
We show that today's probabilistic language generators fall short when it comes to producing coherent and fluent text.<n>We propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Uncovering More Shallow Heuristics: Probing the Natural Language
Inference Capacities of Transformer-Based Pre-Trained Language Models Using
Syllogistic Patterns [9.031827448667086]
We explore the shallows used by transformer-based pre-trained language models (PLMs) that are fine-tuned for natural language inference (NLI)
We find evidence that the models rely heavily on certain shallows, picking up on symmetries and asymmetries between premise and hypothesis.
arXiv Detail & Related papers (2022-01-19T14:15:41Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - A Systematic Investigation of Commonsense Understanding in Large
Language Models [23.430757316504316]
Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting.
We ask whether these models exhibit commonsense understanding by evaluating models against four commonsense benchmarks.
arXiv Detail & Related papers (2021-10-31T22:20:36Z) - Naturalness Evaluation of Natural Language Generation in Task-oriented
Dialogues using BERT [6.1478669848771546]
This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems.
By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results and outperforms the baselines.
arXiv Detail & Related papers (2021-09-07T08:40:14Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.