Fugu-MT 論文翻訳(概要): Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

論文の概要: Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

arxiv url: http://arxiv.org/abs/2605.10810v2
Date: Fri, 15 May 2026 15:01:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:25.946966
Title: Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Title（参考訳）: 数学的テキストの継続のためのいいね!-ショートカット脆弱性テストによる自己教師付きベンチマーク
Authors: Daniel Ranard,
Abstract要約: 技術論文に隠されたテキストを予測するためのベンチマークを自動生成する。別個のスコアラは、$Z$を条件付けせずに次の確率を$Y$に割り当てる。最近の138の物理学と数学の論文からの1363年の方程式の連続について、GPT-5.5、Opus 4.7、GPT-5.4の予測はすべて文脈制御のクリッピング可能性を改善する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$; the evaluated model writes an auxiliary forecast string $Z$, and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$. This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where $Z$ further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.
Abstract（参考訳）: 技術論文に隠されたテキストを予測するためのベンチマークを自動生成する。評価されたモデルは補助予測文字列$Z$を書き、別のスコアラは、$Z$を条件付けせずに次の確率を$Y$に割り当てる。これにより、$Z$が継続に関する情報を送信するかどうかをラベルなしでテストすることができる。我々の主なテストベッドは方程式接尾辞予測であり、予測者は文脈と表示された方程式の最初の部分を見て、残りを予測する。このタスクは、表面レベルのarXiv/TeXテキストモデリングと推論に敏感な推論を混合する。最近の138の物理学と数学の論文からの1363年の方程式の連続について、GPT-5.5、Opus 4.7、GPT-5.4ナノの予測は、Qwen3-8BとKim K2.6スコアラーの両方の文脈制御におけるクリップされた可能性を改善し、モデルファミリと人間のラベルなしでの推論と快適な設定を区別した。有用な予測を行うのではなく、さらに$Z$でスコアラーを素数化するショートカットをエミュレートするために、コンテキストのみのプロンプトでスコアラーを微調整し、ホールドアウト紙にそれをより強力な制御として適用する。 GPT-5.5の予測は依然としてこの微調整された制御に勝っているが、GPT-5.4のナノ予測はそうではない。より長い散文/TeX連続は、目標の開始付近に集中して、肯定的ではあるがノイズの多い昇降制御を示す。これらの結果は、静的なベンチマークとして、および強化学習やモデル選択最適化を適用する前に、ショートカット脆弱性を探索するための設定として、クロスモデル確率スコアリングをサポートする。

論文の概要: Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

関連論文リスト