BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models
- URL: http://arxiv.org/abs/2404.04113v1
- Date: Fri, 5 Apr 2024 14:13:55 GMT
- Title: BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models
- Authors: Jacek Wiland, Max Ploner, Alan Akbik,
- Abstract summary: Probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training.
Previous approaches rely on the objective function used in pre-training LMs.
We propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement.
- Score: 2.2863439039616127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.
Related papers
- AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models [2.1311017627417]
Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase.
We present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism.
arXiv Detail & Related papers (2024-08-28T11:44:52Z) - Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought [51.240387516059535]
We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., 1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks.
We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals.
arXiv Detail & Related papers (2024-04-04T12:46:37Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - IELM: An Open Information Extraction Benchmark for Pre-Trained Language
Models [75.48081086368606]
We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM)
We create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs.
Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets.
arXiv Detail & Related papers (2022-10-25T16:25:00Z) - An Interpretability Evaluation Benchmark for Pre-trained Language Models [37.16893581395874]
We propose a novel evaluation benchmark providing with both English and Chinese annotated data.
It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation.
It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness.
arXiv Detail & Related papers (2022-07-28T08:28:09Z) - Sort by Structure: Language Model Ranking as Dependency Probing [25.723591566201343]
Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored.
We propose probing to rank LMs, specifically for parsing dependencies in a given language, by measuring the degree to which labeled trees are recoverable from an LM's contextualized embeddings.
Across 46 typologically and architecturally diverse LM-language pairs, our approach predicts the best LM choice of 79% of orders of less compute than training a full magnitude of orders of less compute.
arXiv Detail & Related papers (2022-06-10T08:10:29Z) - oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition.
A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.