Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
- URL: http://arxiv.org/abs/2310.17567v1
- Date: Thu, 26 Oct 2023 16:55:05 GMT
- Title: Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
- Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh
Goyal, Sanjeev Arora
- Abstract summary: Key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned.
This work introduces Skill-Mix, a new evaluation to measure ability to combine skills.
- Score: 50.11814354654953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With LLMs shifting their role from statistical modeling of language to
serving as general-purpose AI agents, how should LLM evaluations change?
Arguably, a key ability of an AI agent is to flexibly combine, as needed, the
basic skills it has learned. The capability to combine skills plays an
important role in (human) pedagogy and also in a paper on emergence phenomena
(Arora & Goyal, 2023).
This work introduces Skill-Mix, a new evaluation to measure ability to
combine skills. Using a list of $N$ skills the evaluator repeatedly picks
random subsets of $k$ skills and asks the LLM to produce text combining that
subset of skills. Since the number of subsets grows like $N^k$, for even modest
$k$ this evaluation will, with high probability, require the LLM to produce
text significantly different from any text in the training set. The paper
develops a methodology for (a) designing and administering such an evaluation,
and (b) automatic grading (plus spot-checking by humans) of the results using
GPT-4 as well as the open LLaMA-2 70B model.
Administering a version of to popular chatbots gave results that, while
generally in line with prior expectations, contained surprises. Sizeable
differences exist among model capabilities that are not captured by their
ranking on popular LLM leaderboards ("cramming for the leaderboard").
Furthermore, simple probability calculations indicate that GPT-4's reasonable
performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior
(Bender et al., 2021), i.e., it combines skills in ways that it had not seen
during training.
We sketch how the methodology can lead to a Skill-Mix based eco-system of
open evaluations for AI capabilities of future models.
Related papers
- Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain.
We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales.
Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z) - Can Models Learn Skill Composition from Examples? [50.5142714905768]
We evaluate the capacity of smaller models to learn compositional generalization from examples.
We show that training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts.
This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.
arXiv Detail & Related papers (2024-09-29T22:14:02Z) - Agentic Skill Discovery [19.5703917813767]
Language-conditioned robotic skills make it possible to apply the high-level reasoning of Large Language Models (LLMs) to low-level robotic control.
A remaining challenge is to acquire a diverse set of fundamental skills.
We introduce a novel framework for skill discovery that is entirely driven by LLMs.
arXiv Detail & Related papers (2024-05-23T19:44:03Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - Large Language Models as Batteries-Included Zero-Shot ESCO Skills
Matchers [0.0]
We propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs)
We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts.
We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM.
arXiv Detail & Related papers (2023-07-07T12:04:12Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.