Language model acceptability judgements are not always robust to context
- URL: http://arxiv.org/abs/2212.08979v1
- Date: Sun, 18 Dec 2022 00:11:06 GMT
- Title: Language model acceptability judgements are not always robust to context
- Authors: Koustuv Sinha, Jon Gauthier, Aaron Mueller, Kanishka Misra, Keren
Fuentes, Roger Levy, Adina Williams
- Abstract summary: We investigate the stability of language models' performance on targeted syntactic evaluations.
We find that model judgements are generally robust when placed in randomly sampled linguistic contexts.
We show that these changes in model performance are not explainable by simple features matching the context and the test inputs.
- Score: 30.868765627701457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Targeted syntactic evaluations of language models ask whether models show
stable preferences for syntactically acceptable content over minimal-pair
unacceptable inputs. Most targeted syntactic evaluation datasets ask models to
make these judgements with just a single context-free sentence as input. This
does not match language models' training regime, in which input sentences are
always highly contextualized by the surrounding corpus. This mismatch raises an
important question: how robust are models' syntactic judgements in different
contexts? In this paper, we investigate the stability of language models'
performance on targeted syntactic evaluations as we vary properties of the
input context: the length of the context, the types of syntactic phenomena it
contains, and whether or not there are violations of grammaticality. We find
that model judgements are generally robust when placed in randomly sampled
linguistic contexts. However, they are substantially unstable for contexts
containing syntactic structures matching those in the critical test content.
Among all tested models (GPT-2 and five variants of OPT), we significantly
improve models' judgements by providing contexts with matching syntactic
structures, and conversely significantly worsen them using unacceptable
contexts with matching but violated syntactic structures. This effect is
amplified by the length of the context, except for unrelated inputs. We show
that these changes in model performance are not explainable by simple features
matching the context and the test inputs, such as lexical overlap and
dependency overlap. This sensitivity to highly specific syntactic features of
the context can only be explained by the models' implicit in-context learning
abilities.
Related papers
- Investigating Idiomaticity in Word Representations [9.208145117062339]
We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese)
We present a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels.
We define a set of fine-grained metrics of Affinity and Scaled Similarity to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
arXiv Detail & Related papers (2024-11-04T21:05:01Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Quantifying the Plausibility of Context Reliance in Neural Machine
Translation [25.29330352252055]
We introduce Plausibility Evaluation of Context Reliance (PECoRe)
PECoRe is an end-to-end interpretability framework designed to quantify context usage in language models' generations.
We use pecore to quantify the plausibility of context-aware machine translation models.
arXiv Detail & Related papers (2023-10-02T13:26:43Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on
a Syntactic Task [70.29624135819884]
We study the extent to which BERT is able to perform lexically-independent subject-verb number agreement (NA) on targeted syntactic templates.
Our results on nonce sentences suggest that the model generalizes well for simple templates, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.
arXiv Detail & Related papers (2022-04-14T11:33:15Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Recurrent Neural Network Language Models Always Learn English-Like
Relative Clause Attachment [17.995905582226463]
We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish.
English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences.
arXiv Detail & Related papers (2020-05-01T01:21:47Z) - Attribution Analysis of Grammatical Dependencies in LSTMs [0.043512163406551986]
LSTM language models have been shown to capture syntax-sensitive grammatical dependencies with a high degree of accuracy.
We show that LSTM performance on number agreement is directly correlated with the model's ability to distinguish subjects from other nouns.
Our results suggest that LSTM language models are able to infer robust representations of syntactic dependencies.
arXiv Detail & Related papers (2020-04-30T19:19:37Z) - Don't Judge an Object by Its Context: Learning to Overcome Contextual
Bias [113.44471186752018]
Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy.
This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations.
arXiv Detail & Related papers (2020-01-09T18:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.