Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding
- URL: http://arxiv.org/abs/2204.06283v1
- Date: Wed, 13 Apr 2022 10:32:03 GMT
- Title: Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding
- Authors: Zeming Chen, Qiyue Gao
- Abstract summary: Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
- Score: 1.827510863075184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the age of large transformer language models, linguistic evaluation play
an important role in diagnosing models' abilities and limitations on natural
language understanding. However, current evaluation methods show some
significant shortcomings. In particular, they do not provide insight into how
well a language model captures distinct linguistic skills essential for
language understanding and reasoning. Thus they fail to effectively map out the
aspects of language understanding that remain challenging to existing models,
which makes it hard to discover potential limitations in models and datasets.
In this paper, we introduce Curriculum as a new format of NLI benchmark for
evaluation of broad-coverage linguistic phenomena. Curriculum contains a
collection of datasets that covers 36 types of major linguistic phenomena and
an evaluation procedure for diagnosing how well a language model captures
reasoning skills for distinct types of linguistic phenomena. We show that this
linguistic-phenomena-driven benchmark can serve as an effective tool for
diagnosing model behavior and verifying model learning quality. In addition,
Our experiments provide insight into the limitation of existing benchmark
datasets and state-of-the-art models that may encourage future research on
re-designing datasets, model architectures, and learning objectives.
Related papers
- Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - More Room for Language: Investigating the Effect of Retrieval on Language Models [3.8574940917179164]
We introduce an 'ideal retrieval' methodology to study these models in a fully controllable setting.
We conduct an evaluation to examine how retrieval augmentation affects the behavior of the underlying language model.
arXiv Detail & Related papers (2024-04-16T22:43:48Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Feature Interactions Reveal Linguistic Structure in Language Models [2.0178765779788495]
We study feature interactions in the context of feature attribution methods for post-hoc interpretability.
We work out a grey box methodology, in which we train models to perfection on a formal language classification task.
We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model.
arXiv Detail & Related papers (2023-06-21T11:24:41Z) - Large Linguistic Models: Analyzing theoretical linguistic abilities of
LLMs [7.4815059492034335]
We show that large language models can generate coherent and valid formal analyses of linguistic data.
We focus on three subfields of formal linguistics: syntax, phonology, and semantics.
This line of inquiry exemplifies behavioral interpretability of deep learning, where models' representations are accessed by explicit prompting.
arXiv Detail & Related papers (2023-05-01T17:09:33Z) - Probing via Prompting [71.7904179689271]
This paper introduces a novel model-free approach to probing, by formulating probing as a prompting task.
We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes.
We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.
arXiv Detail & Related papers (2022-07-04T22:14:40Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - A Closer Look at Linguistic Knowledge in Masked Language Models: The
Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on.
We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.