Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
- URL: http://arxiv.org/abs/2601.07606v1
- Date: Mon, 12 Jan 2026 14:55:37 GMT
- Title: Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
- Authors: Bingyang Ye, Shan Chen, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman,
- Abstract summary: PoT is a semi-verifiable benchmarking framework for large language models.<n>It links scientific idea judgments to downstream signals that become observable later.<n>PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
- Score: 12.46263502996029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
Related papers
- DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z) - When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z) - ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control [44.113610704492224]
We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning.<n> Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models.
arXiv Detail & Related papers (2026-02-04T12:41:52Z) - FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z) - Current Agents Fail to Leverage World Model as Tool for Foresight [61.82522354207919]
Generative world models offer a promising remedy: agents could use them to foresee outcomes before acting.<n>This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition.
arXiv Detail & Related papers (2026-01-07T13:15:23Z) - Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions [35.48753431700434]
We study text-only performance forecasting, estimating a model's score from a redacted task description and intended configuration.<n>To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics.<n>Experiments show the task is challenging but feasible, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds.
arXiv Detail & Related papers (2025-09-25T01:02:27Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.