Related papers: SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

URL: http://arxiv.org/abs/2602.22971v1
Date: Thu, 26 Feb 2026 13:08:56 GMT
Title: SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
Authors: Peiyao Xiao, Xiaogang Li, Chengliang Xu, Jiayi Wang, Ben Wang, Zichao Chen, Zeyu Wang, Kejun Yu, Yueqian Chen, Xulin Liu, Wende Xiao, Bing Zhao, Hu Wei,
Abstract summary: We present an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM)<n>By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025.<n>Our pipeline achieves extreme token savings while maintaining high dataset purity.
Score: 12.070587084660096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

Related papers

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models [56.94569090844015]
TokaMark is a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST)<n>TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy.
arXiv Detail & Related papers (2026-02-05T16:49:44Z)
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency [60.83660377169452]
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents.<n>Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges.
arXiv Detail & Related papers (2025-11-29T09:52:34Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning [16.625057220045292]
We present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models.<n> JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence between the posteriori and the inference model.<n>We empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks.
arXiv Detail & Related papers (2025-05-24T06:52:23Z)
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation [11.045086599038338]
We introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states.<n>We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search.
arXiv Detail & Related papers (2025-05-17T14:47:36Z)
Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM [6.26004554105527]
We propose a novel approach to achieve the same results from unannotated full documents using general large language models (LLMs) with lower hardware and labor costs.<n>Our approach combines two major stages: named entity recognition (NER) and relation extraction (RE)<n>To enhance the effectiveness of prompt, we propose a five-part template structure and a scenario-based prompt design principles.
arXiv Detail & Related papers (2025-05-02T07:33:20Z)
Synthetic Data Generation with LLM for Improved Depression Prediction [5.508617844957542]
We propose a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset.
arXiv Detail & Related papers (2024-11-26T18:31:14Z)
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.<n>We introduce MATRIX-Gen for controllable and highly realistic data synthesis.<n>On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z)
Score-based Generative Modeling in Latent Space [93.8985523558869]
Score-based generative models (SGMs) have recently demonstrated impressive results in terms of both sample quality and distribution coverage. Here, we propose the Latent Score-based Generative Model (LSGM), a novel approach that trains SGMs in a latent space. Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space.
arXiv Detail & Related papers (2021-06-10T17:26:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.