Related papers: MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

URL: http://arxiv.org/abs/2510.14944v1
Date: Thu, 16 Oct 2025 17:55:14 GMT
Title: MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics
Authors: Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng, Shuang Zeng, Xingyu Hu, Jinzhuo Wang, May D. Wang,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text.<n> Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases.<n>We introduce MetaBench, the first benchmark for metabolomics assessment.
Score: 23.71774159970153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

Related papers

Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models [45.285970665585914]
We propose a comprehensive framework for Continual Learning.<n>We employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning.<n>We introduce a dynamic knowledge distillation framework.
arXiv Detail & Related papers (2025-12-15T08:09:40Z)
MetaMP: Seamless Metadata Enrichment and AI Application Framework for Enhanced Membrane Protein Visualization and Analysis [0.0]
We present MetaMP, a framework that unifies membrane-protein databases within a web application.<n>In a validation focused on statistics, MetaMP resolved 77% of data discrepancies and accurately predicted the class of newly identified membrane proteins 98% of the time.
arXiv Detail & Related papers (2025-10-06T12:52:50Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research [6.31777560888658]
We present a language-native database for boron nitride nanosheet (BNNS) polymer thermally conductive composites.<n>The system can synthesize literature into accurate, verifiable, and expert style guidance.
arXiv Detail & Related papers (2025-09-07T15:15:55Z)
GenOM: Ontology Matching with Description Generation and Large Language Model [19.917106654694894]
This paper introduces GenOM, a large language model (LLM)-based ontology alignment framework.<n>Experiments conducted on the OAEI Bio-ML track demonstrate that GenOM can often achieve competitive performance.
arXiv Detail & Related papers (2025-08-14T14:48:09Z)
MetamatBench: Integrating Heterogeneous Data, Computational Tools, and Visual Interface for Metamaterial Discovery [35.74367505796871]
We introduce a unified framework, named MetamatBench, that operates on three levels.<n>At the data level, we integrate and standardize 5 heterogeneous, multi-modal metamaterial datasets.<n>The ML level provides a comprehensive toolkit that adapts 17 state-of-the-art ML methods for metamaterial discovery.<n>The user level features a visual-interactive interface that bridges the gap between complex ML techniques and non-ML researchers.
arXiv Detail & Related papers (2025-05-08T19:23:59Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
Meta-RTL: Reinforcement-Based Meta-Transfer Learning for Low-Resource Commonsense Reasoning [61.8360232713375]
We propose a reinforcement-based multi-source meta-transfer learning framework (Meta-RTL) for low-resource commonsense reasoning.<n>We present a reinforcement-based approach to dynamically estimating source task weights that measure the contribution of the corresponding tasks to the target task in the meta-transfer learning.<n> Experimental results demonstrate that Meta-RTL substantially outperforms strong baselines and previous task selection strategies.
arXiv Detail & Related papers (2024-09-27T18:22:22Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
DAC-MR: Data Augmentation Consistency Based Meta-Regularization for Meta-Learning [55.733193075728096]
We propose a meta-knowledge informed meta-learning (MKIML) framework to improve meta-learning. We preliminarily integrate meta-knowledge into meta-objective via using an appropriate meta-regularization (MR) objective. The proposed DAC-MR is hopeful to learn well-performing meta-models from training tasks with noisy, sparse or unavailable meta-data.
arXiv Detail & Related papers (2023-05-13T11:01:47Z)
Alchemy: A structured task distribution for meta-reinforcement learning [52.75769317355963]
We introduce a new benchmark for meta-RL research, which combines structural richness with structural transparency. Alchemy is a 3D video game, which involves a latent causal structure that is resampled procedurally from episode to episode. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents.
arXiv Detail & Related papers (2021-02-04T23:40:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.