Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
- URL: http://arxiv.org/abs/2310.17567v1
- Date: Thu, 26 Oct 2023 16:55:05 GMT
- Title: Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
- Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh
Goyal, Sanjeev Arora
- Abstract summary: Key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned.
This work introduces Skill-Mix, a new evaluation to measure ability to combine skills.
- Score: 50.11814354654953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With LLMs shifting their role from statistical modeling of language to
serving as general-purpose AI agents, how should LLM evaluations change?
Arguably, a key ability of an AI agent is to flexibly combine, as needed, the
basic skills it has learned. The capability to combine skills plays an
important role in (human) pedagogy and also in a paper on emergence phenomena
(Arora & Goyal, 2023).
This work introduces Skill-Mix, a new evaluation to measure ability to
combine skills. Using a list of $N$ skills the evaluator repeatedly picks
random subsets of $k$ skills and asks the LLM to produce text combining that
subset of skills. Since the number of subsets grows like $N^k$, for even modest
$k$ this evaluation will, with high probability, require the LLM to produce
text significantly different from any text in the training set. The paper
develops a methodology for (a) designing and administering such an evaluation,
and (b) automatic grading (plus spot-checking by humans) of the results using
GPT-4 as well as the open LLaMA-2 70B model.
Administering a version of to popular chatbots gave results that, while
generally in line with prior expectations, contained surprises. Sizeable
differences exist among model capabilities that are not captured by their
ranking on popular LLM leaderboards ("cramming for the leaderboard").
Furthermore, simple probability calculations indicate that GPT-4's reasonable
performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior
(Bender et al., 2021), i.e., it combines skills in ways that it had not seen
during training.
We sketch how the methodology can lead to a Skill-Mix based eco-system of
open evaluations for AI capabilities of future models.
Related papers
- Agentic Skill Discovery [19.5703917813767]
Language-conditioned robotic skills make it possible to apply the high-level reasoning of Large Language Models to low-level robotic control.
A remaining challenge is to acquire a diverse set of fundamental skills.
Existing approaches either manually decompose a complex task into atomic robotic actions in a top-down fashion, or bootstrap as many combinations as possible in a bottom-up fashion to cover a wider range of task possibilities.
We show that starting with zero skill, the ASD skill library emerges and expands to more and more meaningful and reliable skills.
arXiv Detail & Related papers (2024-05-23T19:44:03Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - FLM-101B: An Open LLM and How to Train It with $100K Budget [64.7903965253781]
Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others.
Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations.
We demonstrate a solution to significantly reduce LLM training cost through a growth strategy.
inspired by IQ tests, we also consolidate an additional range of evaluations on top of existing evaluations that focus on knowledge-oriented abilities.
Experimental results show that our model, named FLM-101B, trained with a budget of 100K US dollars, achieves performance comparable to powerful and well-known
arXiv Detail & Related papers (2023-09-07T17:07:36Z) - Large Language Models as Batteries-Included Zero-Shot ESCO Skills
Matchers [0.0]
We propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs)
We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts.
We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM.
arXiv Detail & Related papers (2023-07-07T12:04:12Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.