Related papers: Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

URL: http://arxiv.org/abs/2204.06283v1
Date: Wed, 13 Apr 2022 10:32:03 GMT
Title: Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding
Authors: Zeming Chen, Qiyue Gao
Abstract summary: Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
Score: 1.827510863075184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the age of large transformer language models, linguistic evaluation play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current evaluation methods show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic skills essential for language understanding and reasoning. Thus they fail to effectively map out the aspects of language understanding that remain challenging to existing models, which makes it hard to discover potential limitations in models and datasets. In this paper, we introduce Curriculum as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. Curriculum contains a collection of datasets that covers 36 types of major linguistic phenomena and an evaluation procedure for diagnosing how well a language model captures reasoning skills for distinct types of linguistic phenomena. We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality. In addition, Our experiments provide insight into the limitation of existing benchmark datasets and state-of-the-art models that may encourage future research on re-designing datasets, model architectures, and learning objectives.

Related papers

Linguistic Interpretability of Transformer-based Language Models: a systematic review [1.3194391758295114]
Language models based on the Transformer architecture achieve excellent results in many language-related tasks. However, little is known about how their internal computations help them achieve their results. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models.
arXiv Detail & Related papers (2025-04-09T08:00:12Z)
Linguistically Grounded Analysis of Language Models using Shapley Head Values [2.914115079173979]
We investigate the processing of morphosyntactic phenomena by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs) Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions are handled. Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information.
arXiv Detail & Related papers (2024-10-17T09:48:08Z)
Language Models as Models of Language [0.0]
This chapter critically examines the potential contributions of modern language models to theoretical linguistics. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various linguistic phenomena. I conclude that closer collaboration between theoretical linguists and computational researchers could yield valuable insights.
arXiv Detail & Related papers (2024-08-13T18:26:04Z)
Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z)
Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
More Room for Language: Investigating the Effect of Retrieval on Language Models [3.8574940917179164]
We introduce an 'ideal retrieval' methodology to study these models in a fully controllable setting. We conduct an evaluation to examine how retrieval augmentation affects the behavior of the underlying language model.
arXiv Detail & Related papers (2024-04-16T22:43:48Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Large Linguistic Models: Analyzing theoretical linguistic abilities of LLMs [7.4815059492034335]
We show that large language models can generate coherent and valid formal analyses of linguistic data. We focus on three subfields of formal linguistics: syntax, phonology, and semantics. This line of inquiry exemplifies behavioral interpretability of deep learning, where models' representations are accessed by explicit prompting.
arXiv Detail & Related papers (2023-05-01T17:09:33Z)
Probing via Prompting [71.7904179689271]
This paper introduces a novel model-free approach to probing, by formulating probing as a prompting task. We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes. We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.
arXiv Detail & Related papers (2022-07-04T22:14:40Z)
A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.