Related papers: A Graph-Based Test-Harness for LLM Evaluation

A Graph-Based Test-Harness for LLM Evaluation

URL: http://arxiv.org/abs/2508.20810v1
Date: Thu, 28 Aug 2025 14:10:59 GMT
Title: A Graph-Based Test-Harness for LLM Evaluation
Authors: Jessica Lundin, Guillaume Chabot-Couture,
Abstract summary: We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions.<n>We transform the WHO IMCI handbook into a directed graph with 200+ nodes and generate questions that incorporate age-specific scenarios.<n>We find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care.
Score: 0.8164433158925593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions, with 3.3+ trillion possible combinations, covering 100\% of guideline relationships. We transformed the WHO IMCI handbook into a directed graph with 200+ nodes (conditions, symptoms, treatments, follow-ups, severities) and 300+ edges, then used graph traversal to generate questions that incorporated age-specific scenarios and contextual distractors to ensure clinical relevance. Our graph-based approach enables systematic evaluation across clinical tasks (45-67\% accuracy), and we find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care, demonstrating how customized benchmarks can identify specific capability gaps that general-domain evaluations miss. Beyond evaluation, this dynamic MCQA methodology enhances LLM post-training (supervised finetuning, GRPO, DPO), where correct answers provide high-reward samples without expensive human annotation. The graph-based approach successfully addresses the coverage limitations of manually curated benchmarks. This methodology is a step toward scalable, contamination-resistant solution for creating comprehensive benchmarks that can be dynamically generated, including when the guidelines are updated. Code and datasets are available at https://github.com/jessicalundin/graph_testing_harness

Related papers

Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering [19.247242477915382]
Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data.<n>Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment.<n>We present PyAGC, a production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties.
arXiv Detail & Related papers (2026-02-09T11:07:24Z)
Semi-supervised Instruction Tuning for Large Language Models on Text-Attributed Graphs [62.544129365882014]
We propose a novel Semi-supervised Instruction Tuning pipeline for Graph Learning, named SIT-Graph.<n> SIT-Graph is model-agnostic and can be seamlessly integrated into any graph instruction tuning method that utilizes LLMs as the predictor.<n>Extensive experiments demonstrate that when incorporated into state-of-the-art graph instruction tuning methods, SIT-Graph significantly enhances their performance on text-attributed graph benchmarks.
arXiv Detail & Related papers (2026-01-19T08:10:53Z)
Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA [1.9615061725959186]
Large Language Models offer transformative potential for data processing, but face limitations in clinical settings.<n>Current solutions typically isolate retrieval methods focusing on structured data (Text2Cypher) or unstructured semantic search but fail to integrate both simultaneously.<n>This work presents MediGRAF, a novel hybrid Graph RAG system that bridges this gap.
arXiv Detail & Related papers (2025-11-27T16:08:22Z)
GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians [32.33432636089606]
Current benchmarks for AI clinician systems fail to capture the depth, robustness, and safety required for real-world clinical practice.<n>We introduce the GAPS framework, a multidimensional paradigm for evaluating textbfGrounding (cognitive depth), textbfAdequacy (answer completeness), textbfPerturbation (robustness), and textbfSafety.<n>We develop a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end.
arXiv Detail & Related papers (2025-10-15T16:40:28Z)
HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z)
Unsupervised Clustering Approaches for Autism Screening: Achieving 95.31% Accuracy with a Gaussian Mixture Model [0.0]
Autism spectrum disorder (ASD) remains a challenging condition to diagnose effectively and promptly.<n>Traditional diagnostic methods presuppose the availability of labeled data, which can be both time-consuming and resource-intensive to obtain.<n>This paper explores the use of four distinct unsupervised clustering algorithms to analyze a publicly available dataset of 704 adult individuals screened for ASD.
arXiv Detail & Related papers (2025-02-20T18:12:59Z)
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation [9.286509119104563]
We introduce a novel graph-based Retrieval-Augmented Generation framework specifically designed for the medical domain, called MedGraphRAG. Our approach is validated on 9 medical Q&A benchmarks, 2 health fact-checking benchmarks, and one collected dataset testing long-form generation.
arXiv Detail & Related papers (2024-08-08T03:11:12Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
Extended Graph Assessment Metrics for Graph Neural Networks [13.49677006107642]
We introduce extended graph assessment metrics (GAMs) for regression tasks and continuous adjacency matrices. We show the correlation of these metrics with model performance on different medical population graphs and under different learning settings.
arXiv Detail & Related papers (2023-07-13T13:55:57Z)
Transductive Linear Probing: A Novel Framework for Few-Shot Node Classification [56.17097897754628]
We show that transductive linear probing with self-supervised graph contrastive pretraining can outperform the state-of-the-art fully supervised meta-learning based methods under the same protocol. We hope this work can shed new light on few-shot node classification problems and foster future research on learning from scarcely labeled instances on graphs.
arXiv Detail & Related papers (2022-12-11T21:10:34Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
VAESim: A probabilistic approach for self-supervised prototype discovery [0.23624125155742057]
We propose an architecture for image stratification based on a conditional variational autoencoder. We use a continuous latent space to represent the continuum of disorders and find clusters during training, which can then be used for image/patient stratification. We demonstrate that our method outperforms baselines in terms of kNN accuracy measured on a classification task against a standard VAE.
arXiv Detail & Related papers (2022-09-25T17:55:31Z)
Structured Graph Learning for Clustering and Semi-supervised Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data. Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure. Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)
Active Learning on Attributed Graphs via Graph Cognizant Logistic Regression and Preemptive Query Generation [37.742218733235084]
We propose a novel graph-based active learning algorithm for the task of node classification in attributed graphs. Our algorithm uses graph cognizant logistic regression, equivalent to a linearized graph convolutional neural network (GCN) for the prediction phase and maximizes the expected error reduction in the query phase. We conduct experiments on five public benchmark datasets, demonstrating a significant improvement over state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-09T18:00:53Z)
ECG-DelNet: Delineation of Ambulatory Electrocardiograms with Mixed Quality Labeling Using Neural Networks [69.25956542388653]
Deep learning (DL) algorithms are gaining weight in academic and industrial settings. We demonstrate DL can be successfully applied to low interpretative tasks by embedding ECG detection and delineation onto a segmentation framework. The model was trained using PhysioNet's QT database, comprised of 105 ambulatory ECG recordings.
arXiv Detail & Related papers (2020-05-11T16:29:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.