Related papers: Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs

Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs

URL: http://arxiv.org/abs/2510.11313v1
Date: Mon, 13 Oct 2025 12:03:06 GMT
Title: Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs
Authors: Le Ngoc Luyen, Marie-Hélène Abel,
Abstract summary: This paper investigates automated skill decomposition using Large Language Models (LLMs)<n>Our framework standardizes the pipeline from prompting and generation to normalization and alignment with ontology nodes.<n>To evaluate outputs, we introduce two metrics: a F1-score that uses optimal embedding-based matching to assess content accuracy, and a hierarchy-aware F1-score that credits structurally correct placements to assess granularity.
Score: 1.2891210250935148
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper investigates automated skill decomposition using Large Language Models (LLMs) and proposes a rigorous, ontology-grounded evaluation framework. Our framework standardizes the pipeline from prompting and generation to normalization and alignment with ontology nodes. To evaluate outputs, we introduce two metrics: a semantic F1-score that uses optimal embedding-based matching to assess content accuracy, and a hierarchy-aware F1-score that credits structurally correct placements to assess granularity. We conduct experiments on ROME-ESCO-DecompSkill, a curated subset of parents, comparing two prompting strategies: zero-shot and leakage-safe few-shot with exemplars. Across diverse LLMs, zero-shot offers a strong baseline, while few-shot consistently stabilizes phrasing and granularity and improves hierarchy-aware alignment. A latency analysis further shows that exemplar-guided prompts are competitive - and sometimes faster - than unguided zero-shot due to more schema-compliant completions. Together, the framework, benchmark, and metrics provide a reproducible foundation for developing ontology-faithful skill decomposition systems.

Related papers

Optimizing In-Context Demonstrations for LLM-based Automated Grading [31.353360036776976]
GUIDE (Grading Using Iteratively Designed Exemplars) is a framework that reframes exemplar selection and refinement as a boundary-focused optimization problem.<n>We show that GUIDE significantly outperforms standard retrieval baselines in experiments in physics, chemistry, and pedagogical content knowledge.
arXiv Detail & Related papers (2026-02-28T04:52:38Z)
Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays [15.895792302323883]
In educational contexts, teachers and learners require interpretable, trait-level feedback.<n>We study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms.<n>We show that explicitly modeling score ordinality substantially improves agreement with human raters.
arXiv Detail & Related papers (2026-02-04T14:30:52Z)
GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis [20.77940776708036]
Pairwise ordinal learning frameworks capture relative order by learning from comparisons.<n>They assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples.<n>We propose a Two-Stage Group-wise Ranking and Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization.<n>GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
arXiv Detail & Related papers (2026-01-14T16:26:44Z)
Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z)
LLM-guided Hierarchical Retrieval [54.73080745446999]
LATTICE is a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity.<n>A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy.<n>Our framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark.
arXiv Detail & Related papers (2025-10-15T07:05:17Z)
CRACQ: A Multi-Dimensional Approach To Automated Document Assessment [0.0]
CRACQ is a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality.<n>It integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis.
arXiv Detail & Related papers (2025-09-26T17:01:54Z)
TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits [7.615431299673158]
Traditional black-box optimization achieves sampling efficiency but lacks circuit understanding.<n>We propose TopoSizing, an end-to-end framework that performs robust circuit understanding directly from raw netlists.
arXiv Detail & Related papers (2025-09-17T16:52:46Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation [15.230902967865925]
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge.<n>Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization.<n>Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems.
arXiv Detail & Related papers (2025-07-23T16:14:08Z)
NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z)
LGAI-EMBEDDING-Preview Technical Report [41.68404082385825]
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks.<n>Our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings.<n>Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score.
arXiv Detail & Related papers (2025-06-09T05:30:35Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.