Related papers: CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

URL: http://arxiv.org/abs/2510.11985v1
Date: Mon, 13 Oct 2025 22:28:51 GMT
Title: CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
Authors: Owen Queen, Harrison G. Zhang, James Zou,
Abstract summary: Generative language models (LMs) can facilitate the translation of fundamental research into clinically-actionable insights.<n>CGBench is a benchmark that tests reasoning capabilities of LMs on scientific publications.<n>We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation.
Score: 25.578430277176988
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.

Related papers

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding [30.790301729371475]
Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks.<n>We introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases.<n>The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios.
arXiv Detail & Related papers (2026-01-19T08:06:35Z)
GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models [2.770730728142587]
This study investigates the ability of large language models to identify research knowledge gaps in the biomedical literature.<n>We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge.<n>We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles.
arXiv Detail & Related papers (2025-10-29T00:46:45Z)
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning [65.17173232816818]
We introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos.<n>We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes.<n>Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning.
arXiv Detail & Related papers (2025-10-13T16:45:28Z)
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z)
Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge. LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect. We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z)
Generative causal testing to bridge data-driven models and scientific theories in language neuroscience [82.995061475971]
We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain.<n>We show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity.
arXiv Detail & Related papers (2024-10-01T15:57:48Z)
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction. We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets. We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks [52.61917615039112]
We use CausalGym to benchmark the ability of interpretability methods to causally affect model behaviour. We study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods. We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena.
arXiv Detail & Related papers (2024-02-19T21:35:56Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination [26.878606171228448]
Existing explanation datasets are mostly English-language general knowledge questions. To address the language bias and lack of medical resources in generating rationales QA datasets, we present ExplainCPE.
arXiv Detail & Related papers (2023-05-22T11:45:42Z)
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark. It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification. We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.