Fugu-MT 論文翻訳(概要): DeonticBench: A Benchmark for Reasoning over Rules

論文の概要: DeonticBench: A Benchmark for Reasoning over Rules

arxiv url: http://arxiv.org/abs/2604.04443v1
Date: Mon, 06 Apr 2026 05:41:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.104524
Title: DeonticBench: A Benchmark for Reasoning over Rules
Title（参考訳）: DeonticBench: ルールに対する推論のベンチマーク
Authors: Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme,
Abstract要約: DEONTICBENCHは、アメリカ合衆国連邦政府の税、航空会社の荷物政策、移民管理、および合衆国の住宅法に関する6,232のタスクのベンチマークである。これは、実世界のドメインにおいて、象徴的および非象徴的な設定の下でコンテキスト基底ルール推論を研究するためのベンチマークである。
参考スコア（独自算出の注目度）: 52.69517904415795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.
Abstract（参考訳）: 複雑なコンテキスト固有のルールによる推論は、大きな言語モデル(LLM)では依然として困難である。法律と政策の設定では、これは義務、許可、および明示的な規則の下での禁止についての推論という非合法的な推論として現れている。最近のベンチマークでは、短いコンテキストの数学的推論に重点を置いているが、長いコンテキストの非音韻的推論にはあまり焦点をあてていない。このギャップに対処するために、連邦税6,232件のタスクの基準であるDeONTICBENCHを紹介します。これらのタスクは、言語での直接推論やシンボリック計算の助けを借りて、複数の方法でアプローチすることができる。 DEONTICBENCHは自由形式のチェーン・オブ・シークレット推論の他に、モデルが規則とケース事実を実行可能なPrologに変換し、形式的な問題解釈と明示的なプログラムトレースをもたらす、任意のソルバベースのワークフローを可能にする。すべてのインスタンスに対する参照Prologプログラムをリリースします。最強のハードサブセット性能はSARA Numericで44.4%、Housingで46.6マクロF1に達する。シンボリックプログラム生成のための教師付き微調整と強化学習によるトレーニングをさらに研究する。トレーニングによってProlog生成の品質が向上するが、現在のRLメソッドはこれらのタスクを確実に解決することができない。全体として、DONTICBENCHは、実世界のドメインにおいて、象徴的および非象徴的な設定の下でコンテキスト基底ルール推論を研究するためのベンチマークを提供する。

論文の概要: DeonticBench: A Benchmark for Reasoning over Rules

関連論文リスト