Fugu-MT 論文翻訳(概要): TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

論文の概要: TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

arxiv url: http://arxiv.org/abs/2509.23686v1
Date: Sun, 28 Sep 2025 06:57:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.377367
Title: TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F
Title（参考訳）: TF-Bench: システムFにおける型推論によるプログラムセマンティクスの評価
Authors: Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen,
Abstract要約: 大規模言語モデル(LLM)は、ソフトウェアエンジニアリングエコシステムにますます統合されています。本稿では,システムFの型推論に基づいてLLM推論を評価するベンチマークであるTF-Benchを紹介する。 TF-Bench_pureは、純粋に意味論的に駆動されるTF-Benchの変種である。
参考スコア（独自算出の注目度）: 5.6064011695311455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ソフトウェアエンジニアリングエコシステムにますます統合されています。テスト時間計算(TTC)推論能力は、単なるトークン認識以上のプログラムロジックやセマンティクスを理解する大きな可能性を示している。しかし、現在のコード推論のベンチマークでは、健全な評価を保証するための形式的なプログラム中心の推論フレームワークが欠如しており、モデルがプログラムの意味論について真に理由付けしているか、それとも単に自然言語とコードトークン間の表面的関連性を利用するのかを評価することができない。このギャップを埋めるために,システムFの型推論に基づくLSM推論を評価するためのベンチマークTF-Benchを導入する。検証された変換を用いて意味的に無関係な自然言語を除去し、純粋に意味論的に駆動されるTF-Benchの変種であるTF-Bench_pureを構築する。 TF-Bench_pureで55.85%の精度しか達成できないLLM(Claude-3.7-sonnet)を最適性能で実現した。さらに、テスト時間推論の有効性とロバスト性を評価するための2つの新しい指標を提案し、現在のLLM能力の限界を強調し、今後の研究に欠かせない方向性を強調した。

論文の概要: TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

関連論文リスト