LLM-based Evaluation Policy Extraction for Ecological Modeling
- URL: http://arxiv.org/abs/2505.13794v1
- Date: Tue, 20 May 2025 01:02:29 GMT
- Title: LLM-based Evaluation Policy Extraction for Ecological Modeling
- Authors: Qi Cheng, Licheng Liu, Qing Zhu, Runlong Yu, Zhenong Jin, Yiqun Xie, Xiaowei Jia,
- Abstract summary: evaluating ecological time series is critical for benchmarking model performance in many important applications.<n>Traditional numerical metrics fail to capture domain-specific temporal patterns critical to ecological processes.<n>We propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction.
- Score: 22.432508855430797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies.
Related papers
- Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees [18.106204331704156]
We consider settings where the graph structure is co-dependent, and investigate a deep neural network-based approach to estimate it.<n> Theoretical results with PAC guarantees are established for the method, under assumptions commonly used in an Empirical Risk Minimization framework.<n>The performance of the proposed method is evaluated on several synthetic data settings and benchmarked against existing approaches.
arXiv Detail & Related papers (2025-04-23T02:13:36Z) - Bayesian Semi-Parametric Spatial Dispersed Count Model for Precipitation Analysis [0.5399800035598186]
Method combines non-parametric techniques with an adapted dispersed count model based on renewal theory.<n>Applying it to lung and bronchus cancer mortality data from Iowa, emphasizing environmental and demographic factors.<n>This application highlights the significance of our methodology in public health research.
arXiv Detail & Related papers (2025-03-24T20:13:55Z) - On the Reasoning Capacity of AI Models and How to Quantify It [0.0]
Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities.<n>While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks.<n>We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior.
arXiv Detail & Related papers (2025-01-23T16:58:18Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL)
The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics.
Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z) - Surrogate Model for Geological CO2 Storage and Its Use in Hierarchical
MCMC History Matching [0.0]
We extend the recently introduced recurrent R-U-Net surrogate model to treat geomodel realizations drawn from a wide range of geological scenarios.
We show that, using observed data from monitoring wells in synthetic true' models, geological uncertainty is reduced substantially.
arXiv Detail & Related papers (2023-08-11T18:29:28Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - Continuous-Time Modeling of Counterfactual Outcomes Using Neural
Controlled Differential Equations [84.42837346400151]
Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare.
Existing causal inference approaches consider regular, discrete-time intervals between observations and treatment decisions.
We propose a controllable simulation environment based on a model of tumor growth for a range of scenarios.
arXiv Detail & Related papers (2022-06-16T17:15:15Z) - Energy-Based Learning for Cooperative Games, with Applications to
Feature/Data/Model Valuations [91.36803653600667]
We present a novel energy-based treatment for cooperative games, with a theoretical justification by the maximum entropy framework.
Surprisingly, by conducting variational inference of the energy-based model, we recover various game-theoretic valuation criteria, such as Shapley value and Banzhaf index.
We experimentally demonstrate that the proposed Variational Index enjoys intriguing properties on certain synthetic and real-world valuation problems.
arXiv Detail & Related papers (2021-06-05T17:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.