Pseudointelligence: A Unifying Framework for Language Model Evaluation
- URL: http://arxiv.org/abs/2310.12135v1
- Date: Wed, 18 Oct 2023 17:48:05 GMT
- Title: Pseudointelligence: A Unifying Framework for Language Model Evaluation
- Authors: Shikhar Murty, Orr Paradise, Pratyusha Sharma
- Abstract summary: We propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator.
We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
- Score: 14.95543156914676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With large language models surpassing human performance on an increasing
number of benchmarks, we must take a principled approach for targeted
evaluation of model capabilities. Inspired by pseudorandomness, we propose
pseudointelligence, which captures the maxim that "(perceived) intelligence
lies in the eye of the beholder". That is, that claims of intelligence are
meaningful only when their evaluator is taken into account. Concretely, we
propose a complexity-theoretic framework of model evaluation cast as a dynamic
interaction between a model and a learned evaluator. We demonstrate that this
framework can be used to reason about two case studies in language model
evaluation, as well as analyze existing evaluation methods.
Related papers
- PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation [0.0]
We introduce a novel benchmark for evaluating the role-playing capabilities of language models.
The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality.
arXiv Detail & Related papers (2024-09-10T19:00:44Z) - Estimating Knowledge in Large Language Models Without Generating a Single Token [12.913172023910203]
Current methods to evaluate knowledge in large language models (LLMs) query the model and then evaluate its generated responses.
In this work, we ask whether evaluation can be done before the model has generated any text.
Experiments with a variety of LLMs show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks.
arXiv Detail & Related papers (2024-06-18T14:45:50Z) - Can I understand what I create? Self-Knowledge Evaluation of Large Language Models [31.85129258347539]
Large language models (LLMs) have achieved remarkable progress in linguistic tasks.
Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework.
arXiv Detail & Related papers (2024-06-10T09:53:54Z) - Context versus Prior Knowledge in Language Models [49.17879668110546]
Language models often need to integrate prior knowledge learned during pretraining and new information presented in context.
We propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity.
arXiv Detail & Related papers (2024-04-06T13:46:53Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - Knowledge-Grounded Dialogue Generation with Pre-trained Language Models [74.09352261943911]
We study knowledge-grounded dialogue generation with pre-trained language models.
We propose equipping response generation defined by a pre-trained language model with a knowledge selection module.
arXiv Detail & Related papers (2020-10-17T16:49:43Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z) - Learning to Compare for Better Training and Evaluation of Open Domain
Natural Language Generation Models [23.62054164511058]
We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT.
While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
arXiv Detail & Related papers (2020-02-12T15:52:21Z) - Asking the Right Questions: Learning Interpretable Action Models Through
Query Answering [33.08099403894141]
This paper develops a new approach for estimating an interpretable, relational model of a black-box autonomous agent that can plan and act.
Our main contributions are a new paradigm for estimating such models using a minimal query interface with the agent, and a hierarchical querying algorithm that generates an interrogation policy for estimating the agent's internal model.
arXiv Detail & Related papers (2019-12-29T09:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.