Fugu-MT 論文翻訳(概要): TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

論文の概要: TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

arxiv url: http://arxiv.org/abs/2602.02523v1
Date: Sun, 25 Jan 2026 23:44:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 18:37:14.898295
Title: TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis
Title（参考訳）: TabularMath:プログラム検証合成によるタブラリラーニングにおける計算外挿の評価
Authors: Zerui Cheng, Jiashuo Liu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, Wenhao Huang,
Abstract要約: 8KとAIMEに基づく検証プログラムから生成される114個の決定論的問題(233,472行)の診断ベンチマークであるTabularMathを提案する。標準回帰指標では、TabPFN v2.5は、分布シフト下においてもR2=0.998の分布を達成し、正のR2を維持することができる。丸みを帯びた一貫性(正確には整数の精度)を測定すると、別の図が現れる。 TabPFN v2.5は配布外データで10%以下に低下し、ICLは約40%を維持します。
参考スコア（独自算出の注目度）: 22.883505574924303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard tabular benchmarks mainly focus on the evaluation of a model's capability to interpolate values inside a data manifold, where models good at performing local statistical smoothing are rewarded. However, there exists a very large category of high-value tabular data, including financial modeling and physical simulations, which are generated based upon deterministic computational processes, as opposed to stochastic and noisy relationships. Therefore, we investigate if tabular models can provide an extension from statistical interpolation to computational extrapolation. We propose TabularMath, a diagnostic benchmark of 114 deterministic problems (233,472 rows) generated from verified programs based on GSM8K and AIME. We evaluate 9 tabular architectures and in-context learning (ICL) with GPT-OSS-120B. On standard regression metrics, TabPFN v2.5 performs remarkably well, achieving R^2=0.998 in-distribution and maintaining positive R^2 even under distribution shift, which is unique among the tabular models we tested. When we measure rounded consistency (exact integer match), a different picture emerges: TabPFN v2.5 drops below 10% on out-of-distribution data, while ICL maintains around 40%. This gap between R^2 and exact-match accuracy suggests that tabular models learn smooth function approximations but struggle to recover precise computational outputs under extrapolation. The two paradigms appear complementary: TabPFN scales efficiently with data; ICL achieves exact computation from few examples. We release all code and data to support further investigation.
Abstract（参考訳）: 標準的な表型ベンチマークは主に、局所的な統計スムーシングを行うのに優れたモデルが報酬を受けるデータ多様体内の値を補間するモデルの能力の評価に焦点を当てている。しかし、金融モデリングや物理シミュレーションなど、確率的および雑音的関係とは対照的に、決定論的計算プロセスに基づいて生成される非常に大きなグラフデータカテゴリが存在する。そこで本研究では,統計補間から計算外挿への拡張を表わすことができるか検討する。 GSM8KとAIMEに基づく検証プログラムから生成される114の決定論的問題(233,472行)の診断ベンチマークであるTabularMathを提案する。 GPT-OSS-120Bを用いて9つの表型アーキテクチャとICL(In-context Learning)を評価した。標準回帰指標では,TabPFN v2.5 は,分布シフト下においても R^2=0.998 の分布を達成し,正の R^2 を維持した。 TabPFN v2.5は配布外データで10%以下に低下し、ICLは約40%を維持します。このR^2と正確なマッチング精度の差は、表型モデルは滑らかな関数近似を学習するが、外挿の下で正確な計算出力を回復するのに苦労していることを示している。 TabPFNはデータと効率的にスケールし、ICLはいくつかの例から正確な計算を行う。さらなる調査を支援するため、すべてのコードとデータを公開しています。

論文の概要: TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

関連論文リスト