Related papers: FormationEval, an open multiple-choice benchmark for petroleum geoscience

FormationEval, an open multiple-choice benchmark for petroleum geoscience

URL: http://arxiv.org/abs/2601.02158v1
Date: Mon, 05 Jan 2026 14:36:02 GMT
Title: FormationEval, an open multiple-choice benchmark for petroleum geoscience
Authors: Almaz Ermilov,
Abstract summary: FormationEval is an open multiple-choice question benchmark for evaluating language models on petroleum geoscience disciplines.<n>The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives.<n>The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97\% accuracy, with Gemini 3 Pro Preview reaching 99.8\%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6\%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93\%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90\% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.

Related papers

GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI [52.13138825802668]
GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
arXiv Detail & Related papers (2025-11-19T17:45:02Z)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation [32.22754624992446]
We present GeoAnalystBench, a benchmark of 50 Python-based tasks derived from real-world geospatial problems.<n>Using this benchmark, we assess both proprietary and open source models.<n>Results reveal a clear gap: proprietary models such as ChatGPT-4o-mini achieve high 95% validity and stronger code alignment.
arXiv Detail & Related papers (2025-09-07T00:51:57Z)
Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models [13.622744836632231]
In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019.<n>We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters.<n>Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks.
arXiv Detail & Related papers (2025-08-17T18:25:37Z)
Approximating Language Model Training Data from Weights [70.08614275061689]
We formalize the problem of data approximation from model weights and propose several baselines and metrics.<n>We develop a gradient-based approach that selects the highest-matching data from a large public text corpus.<n>Even when none of the true training data is known, our method is able to locate a small subset of public Web documents.
arXiv Detail & Related papers (2025-06-18T15:26:43Z)
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models [7.422346909538787]
MapEval is a benchmark designed to assess foundation models across three distinct tasks.<n>It covers spatial relationships, navigation, travel planning, and real-world map interactions.<n>It requires models to handle long-context reasoning, API interactions, and visual map analysis.
arXiv Detail & Related papers (2024-12-31T07:20:32Z)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets. We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.