Related papers: On the Rigour of Scientific Writing: Criteria, Analysis, and Insights

Related papers

DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z)
The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z)
DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey [53.85391477976017]
DeepSurvey-Bench is a novel benchmark designed to comprehensively evaluate the academic value of generated surveys.<n>We construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys.
arXiv Detail & Related papers (2026-01-13T14:42:56Z)
SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z)
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG [10.620784202716404]
We argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators.<n>We present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG.
arXiv Detail & Related papers (2025-10-01T12:55:34Z)
SCI-Verifier: Scientific Verifier with Thinking [37.08904000514563]
Large language models (LLMs) are increasingly applied to scientific reasoning.<n>Existing verification studies in scientific domains suffer from two major limitations.<n>We propose solutions at both the data and model levels.
arXiv Detail & Related papers (2025-09-29T04:58:43Z)
Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models [1.0138329337410974]
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content.<n>This review systematically analyzes how LLM-generated content is evaluated for factual accuracy.
arXiv Detail & Related papers (2025-08-05T19:20:05Z)
Doubly robust estimation of causal effects for random object outcomes with continuous treatments [8.874402662101234]
Causal inference is central to statistics and scientific discovery.<n>Modern applications increasingly involve complex, non-Euclidean data structures.<n>This paper introduces a novel framework for causal inference with continuous treatments applied to non-Euclidean data.
arXiv Detail & Related papers (2025-06-28T04:55:12Z)
Model Reprogramming Demystified: A Neural Tangent Kernel Perspective [49.42322600160337]
We present a comprehensive theoretical analysis of Model Reprogramming (MR) through the lens of the Neural Tangent Kernel (NTK) framework.<n>We demonstrate that the success of MR is governed by the eigenvalue spectrum of the NTK matrix on the target dataset.<n>Our contributions include a novel theoretical framework for MR, insights into the relationship between source and target models, and extensive experiments validating our findings.
arXiv Detail & Related papers (2025-05-31T16:15:04Z)
Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems [44.61435605872856]
We introduce a practical robustness definition grounded in distributional robustness, explicitly tailored to industrial CPS. Our framework simulates realistic disturbances, such as sensor drift, noise and irregular sampling, enabling thorough robustness analyses of forecasting models.
arXiv Detail & Related papers (2025-04-04T14:50:48Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined. We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
The BRAVO Semantic Segmentation Challenge Results in UNCV2024 [68.20197719071436]
We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
arXiv Detail & Related papers (2024-09-23T15:17:30Z)
The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z)
Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale [2.50194939587674]
dissertation: quantifying and mitigating sources of arbitiness in ML, randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability. dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately bound up with research in law and policy.
arXiv Detail & Related papers (2024-06-13T19:29:37Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric [5.592917884093537]
The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems. The capabilities of the NoRefER metric are explored in identifying word-level errors to aid post-editors in refining ASR hypotheses.
arXiv Detail & Related papers (2024-01-20T16:48:55Z)
A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models [0.0]
The study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. The framework consistently delivers accurate domain-specific responses with minimal human oversight.
arXiv Detail & Related papers (2023-12-31T17:15:25Z)
Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science [0.0]
This study investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data. We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions. We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones.
arXiv Detail & Related papers (2023-11-15T20:42:11Z)
The Unreasonable Effectiveness of Deep Evidential Regression [72.30888739450343]
A new approach with uncertainty-aware regression-based neural networks (NNs) shows promise over traditional deterministic methods and typical Bayesian NNs. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a quantification rather than an exact uncertainty.
arXiv Detail & Related papers (2022-05-20T10:10:32Z)
Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation. The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z)
Learning Topic Models: Identifiability and Finite-Sample Analysis [6.181048261489101]
We propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood. We conclude with empirical studies on both simulated and real datasets.
arXiv Detail & Related papers (2021-10-08T16:35:42Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.