oLMpics -- On what Language Model Pre-training Captures
- URL: http://arxiv.org/abs/1912.13283v2
- Date: Thu, 19 Nov 2020 08:24:06 GMT
- Title: oLMpics -- On what Language Model Pre-training Captures
- Authors: Alon Talmor, Yanai Elazar, Yoav Goldberg, Jonathan Berant
- Abstract summary: We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition.
A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
- Score: 84.60594612120173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent success of pre-trained language models (LMs) has spurred widespread
interest in the language capabilities that they possess. However, efforts to
understand whether LM representations are useful for symbolic reasoning tasks
have been limited and scattered. In this work, we propose eight reasoning
tasks, which conceptually require operations such as comparison, conjunction,
and composition. A fundamental challenge is to understand whether the
performance of a LM on a task should be attributed to the pre-trained
representations or to the process of fine-tuning on the task data. To address
this, we propose an evaluation protocol that includes both zero-shot evaluation
(no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to
the learning curve of multiple controls, which paints a rich picture of the LM
capabilities. Our main findings are that: (a) different LMs exhibit
qualitatively different reasoning abilities, e.g., RoBERTa succeeds in
reasoning tasks where BERT fails completely; (b) LMs do not reason in an
abstract manner and are context-dependent, e.g., while RoBERTa can compare
ages, it can do so only when the ages are in the typical range of human ages;
(c) On half of our reasoning tasks all models fail completely. Our findings and
infrastructure can help future work on designing new datasets, models and
objective functions for pre-training.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data [3.376269351435396]
We develop a formal perspective on probing using structural causal models (SCM)
We extend a recent study of LMs in the context of a synthetic grid-world navigation task.
Our techniques provide robust empirical evidence for the ability of LMs to induce the latent concepts underlying text.
arXiv Detail & Related papers (2024-07-18T17:59:27Z) - Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning [55.265138447400744]
Statement-Tuning is a technique that models discriminative tasks as a set of finite statements and trains an encoder model to discriminate between the potential statements to determine the label.
Experimental results demonstrate that Statement-Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters.
The study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement-Tuning can achieve strong performance with modest training data.
arXiv Detail & Related papers (2024-04-19T14:05:03Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Concept-aware Training Improves In-context Learning Ability of Language
Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability.
We propose a method to create LMs able to better utilize the in-context information.
We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z) - Limitations of Language Models in Arithmetic and Symbolic Induction [20.49118435604774]
Large pretrained Language Models (LMs) can perform remarkably well on a range of Natural Language Processing (NLP) tasks.
We find that these models have limitations on certain basic symbolic manipulation tasks such as copy, reverse, and addition.
We investigate the potential causes behind this phenomenon and examine a set of possible methods, including explicit positional markers, fine-grained computation steps, and LMs with callable programs.
arXiv Detail & Related papers (2022-08-09T21:47:01Z) - An Interpretability Evaluation Benchmark for Pre-trained Language Models [37.16893581395874]
We propose a novel evaluation benchmark providing with both English and Chinese annotated data.
It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation.
It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness.
arXiv Detail & Related papers (2022-07-28T08:28:09Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.