Related papers: SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

URL: http://arxiv.org/abs/2506.00319v1
Date: Sat, 31 May 2025 00:08:59 GMT
Title: SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation
Authors: Yufei Tian, Jiao Sun, Nanyun Peng, Zizhao Zhang,
Abstract summary: SkillVerse is an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities.<n>Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models.
Score: 70.27631454256024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.

Related papers

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving [54.44269624385919]
EvaLearn is a benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.<n>We benchmark nine frontier models and observe varied performance profiles.<n>We observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks.
arXiv Detail & Related papers (2025-06-03T09:18:33Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase. In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training. We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z)
Rubric-based Learner Modelling via Noisy Gates Bayesian Networks for Computational Thinking Skills Assessment [40.06500618820166]
We develop a learner model for automatic skill assessment from a task-specific competence rubric. We design a network with two layers of gates, one performing disjunctive operations by noisy-OR gates and the other conjunctive operations through logical ANDs. The CT-cube skills assessment framework and the Cross Array Task (CAT) are used to exemplify it and demonstrate its feasibility.
arXiv Detail & Related papers (2024-08-02T12:21:05Z)
Can I understand what I create? Self-Knowledge Evaluation of Large Language Models [31.85129258347539]
Large language models (LLMs) have achieved remarkable progress in linguistic tasks. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework.
arXiv Detail & Related papers (2024-06-10T09:53:54Z)
A Mathematical Theory for Learning Semantic Languages by Abstract Learners [9.139188656944429]
We develop a mathematical theory to explain the emergence of learned skills, taking the learning process into account. We demonstrate the emergence of learned skills when the ratio of the number of training texts to the number of skills exceeds a certain threshold. We use site percolation analysis to derive the conditions for the existence of a giant component in the skill association graph.
arXiv Detail & Related papers (2024-04-10T13:50:46Z)
Large Language Models Are Also Good Prototypical Commonsense Reasoners [11.108562540123387]
Traditional fine-tuning approaches can be resource-intensive and potentially compromise a model's generalization capacity. We draw inspiration from the outputs of large models for tailored tasks and semi-automatically developed a set of novel prompts. With better designed prompts we can achieve the new state-of-art(SOTA) on the ProtoQA leaderboard.
arXiv Detail & Related papers (2023-09-22T20:07:24Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it. Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match. We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z)
Distill on the Go: Online knowledge distillation in self-supervised learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation. Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.