Related papers: Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

URL: http://arxiv.org/abs/2310.17567v1
Date: Thu, 26 Oct 2023 16:55:05 GMT
Title: Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora
Abstract summary: Key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. This work introduces Skill-Mix, a new evaluation to measure ability to combine skills.
Score: 50.11814354654953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

Related papers

SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation [70.27631454256024]
SkillVerse is an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities.<n>Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models.
arXiv Detail & Related papers (2025-05-31T00:08:59Z)
AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models [41.247758234888835]
In-context learning (ICL) allows a language model to improve its problem-solving capability when provided with suitable information in context. Recent works show ICL performance can be improved by leveraging a frontier large language model's (LLM) ability to predict required skills to solve a problem. While this skill-based strategy boosts ICL performance in larger models, its gains on small language models (SLMs) have been minimal. We introduce AdaptMI, an adaptive approach to selecting skill-based in-context Math Instructions for SLMs.
arXiv Detail & Related papers (2025-04-30T19:35:46Z)
Control LLM: Controlled Evolution for Intelligence Retention in LLM [4.67235851066221]
We propose textbfControl LLM, a novel approach that leverages parallel pre-trained and expanded transformer blocks. Experiments demonstrate the effectiveness of Control LLM in both Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT) It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute.
arXiv Detail & Related papers (2025-01-19T08:06:06Z)
Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z)
Can Models Learn Skill Composition from Examples? [50.5142714905768]
We evaluate the capacity of smaller models to learn compositional generalization from examples. We show that training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts. This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.
arXiv Detail & Related papers (2024-09-29T22:14:02Z)
Agentic Skill Discovery [19.5703917813767]
Language-conditioned robotic skills make it possible to apply the high-level reasoning of Large Language Models (LLMs) to low-level robotic control. A remaining challenge is to acquire a diverse set of fundamental skills. We introduce a novel framework for skill discovery that is entirely driven by LLMs.
arXiv Detail & Related papers (2024-05-23T19:44:03Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Large Language Models as Batteries-Included Zero-Shot ESCO Skills Matchers [0.0]
We propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs) We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts. We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM.
arXiv Detail & Related papers (2023-07-07T12:04:12Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.