ReCOGS: How Incidental Details of a Logical Form Overshadow an
Evaluation of Semantic Interpretation
- URL: http://arxiv.org/abs/2303.13716v2
- Date: Tue, 23 Jan 2024 21:52:42 GMT
- Title: ReCOGS: How Incidental Details of a Logical Form Overshadow an
Evaluation of Semantic Interpretation
- Authors: Zhengxuan Wu, Christopher D. Manning, Christopher Potts
- Abstract summary: We propose a modified version of the compositional generalization benchmark COGS.
Our results reaffirm the importance of compositional generalization and careful benchmark task design.
- Score: 63.33465936588327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional generalization benchmarks for semantic parsing seek to assess
whether models can accurately compute meanings for novel sentences, but
operationalize this in terms of logical form (LF) prediction. This raises the
concern that semantically irrelevant details of the chosen LFs could shape
model performance. We argue that this concern is realized for the COGS
benchmark. COGS poses generalization splits that appear impossible for
present-day models, which could be taken as an indictment of those models.
However, we show that the negative results trace to incidental features of COGS
LFs. Converting these LFs to semantically equivalent ones and factoring out
capabilities unrelated to semantic interpretation, we find that even baseline
models get traction. A recent variable-free translation of COGS LFs suggests
similar conclusions, but we observe this format is not semantically equivalent;
it is incapable of accurately representing some COGS meanings. These findings
inform our proposal for ReCOGS, a modified version of COGS that comes closer to
assessing the target semantic capabilities while remaining very challenging.
Overall, our results reaffirm the importance of compositional generalization
and careful benchmark task design.
Related papers
- CoGS: Model Agnostic Causality Constrained Counterfactual Explanations using goal-directed ASP [1.5749416770494706]
CoGS is a model-agnostic framework capable of generating counterfactual explanations for classification models.
CoGS offers interpretable and actionable explanations of the changes required to achieve the desired outcome.
arXiv Detail & Related papers (2024-10-30T00:43:01Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models [22.757306452760112]
We introduce QuRe, a crowd-sourced dataset of human-annotated generalized quantifiers in Wikipedia sentences featuring percentage-equipped predicates.
We explore quantifier comprehension in language models using PRESQUE, a framework that combines natural language inference and the Rational Speech Acts framework.
arXiv Detail & Related papers (2023-11-08T13:00:06Z) - SLOG: A Structural Generalization Benchmark for Semantic Parsing [68.19511282584304]
The goal of compositional generalization benchmarks is to evaluate how well models generalize to new complex linguistic expressions.
Existing benchmarks often focus on lexical generalization, the interpretation of novel lexical items in syntactic structures familiar from training, are often underrepresented.
We introduce SLOG, a semantic parsing dataset that extends COGS with 17 structural generalization cases.
arXiv Detail & Related papers (2023-10-23T15:39:09Z) - What Are You Token About? Dense Retrieval as Distributions Over the
Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space.
We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z) - Language model acceptability judgements are not always robust to context [30.868765627701457]
We investigate the stability of language models' performance on targeted syntactic evaluations.
We find that model judgements are generally robust when placed in randomly sampled linguistic contexts.
We show that these changes in model performance are not explainable by simple features matching the context and the test inputs.
arXiv Detail & Related papers (2022-12-18T00:11:06Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Closed-Form Expressions for Global and Local Interpretation of Tsetlin
Machines with Applications to Explaining High-Dimensional Data [7.05622249909585]
We propose closed-form expressions for understanding why a TM model makes a specific prediction (local interpretability)
We also introduce expressions for measuring the importance of feature value ranges for continuous features.
For both classification and regression, our evaluation show correspondence with SHAP as well as competitive prediction accuracy in comparison with XGBoost, Explainable Boosting Machines, and Neural Additive Models.
arXiv Detail & Related papers (2020-07-27T21:47:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.