Related papers: Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

URL: http://arxiv.org/abs/2410.13826v2
Date: Thu, 24 Oct 2024 17:27:22 GMT
Title: Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models
Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet,
Abstract summary: skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
Score: 61.467781476005435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Related papers

One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z)
Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models [3.207886496235499]
We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems.<n>We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$.
arXiv Detail & Related papers (2025-06-16T19:03:06Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text [0.0]
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels.
arXiv Detail & Related papers (2025-01-16T18:06:22Z)
Self-rationalization improves LLM as a fine-grained judge [21.917301609125417]
We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models. Self-rationalization works by having the model generate multiple judgments with rationales for the same input. We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
arXiv Detail & Related papers (2024-10-07T21:05:53Z)
Can Models Learn Skill Composition from Examples? [50.5142714905768]
We evaluate the capacity of smaller models to learn compositional generalization from examples. We show that training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts. This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.
arXiv Detail & Related papers (2024-09-29T22:14:02Z)
Learning Goal-Conditioned Representations for Language Reward Models [10.94845204766088]
We propose training reward models (RMs) in a contrastive, $textitgoal-conditioned$ fashion. We show this way of training RM representations enables improved $textitsteerability$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states.
arXiv Detail & Related papers (2024-07-18T20:23:11Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it. Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match. We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z)
Revisiting Model Stitching to Compare Neural Representations [8.331711958610347]
We consider a "stitched model" formed by connecting the bottom-layers of $A$ to the top-layers of $B$, with a simple trainable layer between them. We show that good networks of the same architecture, but trained in very different ways, can be stitched to each other without drop in performance. We also give evidence for the intuition that "more is better" by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be "plugged in'' to weaker models to improve performance.
arXiv Detail & Related papers (2021-06-14T18:05:10Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.