Fugu-MT 論文翻訳(概要): DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

論文の概要: DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

arxiv url: http://arxiv.org/abs/2511.14813v1
Date: Tue, 18 Nov 2025 02:37:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.474805
Title: DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models
Title（参考訳）: DEVAL: 大規模言語モデルの導出能力の評価と改善のためのフレームワーク
Authors: Yifan Li, Qin Li, Min Zhang, Min Zhang, Peixin Wang,
Abstract要約: 人間の推論は、入力に対する特定の種類の変化に基づいて出力に対応する変更を導出することができる。この推論パターンは、大規模言語モデルでは包括的に説明または評価されていない。本稿では,デリベーション・プロンプティング(Drivation Prompting)と呼ばれる新しいプロンプトエンジニアリング手法を提案する。
参考スコア（独自算出の注目度）: 25.206941240088877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show that mainstream LLMs, such as GPT-4o and Claude3.5, exhibit moderate DR recognition capabilities but reveal significant drop-offs on applying DR effectively in problem-solving scenarios. To improve this, we propose a novel prompt engineering approach called Derivation Prompting (DP). It achieves an average improvement of 15.2% in DC for all tested LLMs, outperforming commonly used prompt engineering techniques.
Abstract（参考訳）: データに対するLLM(Large Language Models)の推論能力を評価することは、オープンで急進的な研究課題である。 LLMと比較して、人間の推論は入力に対する特定の種類の変化に基づいて出力に対応する変更を導出することができる。データ変更間の関係を統括する抽象ルールに依存しているこの推論パターンは、LLMでは包括的に説明され、評価されていない。本稿では、この推論パターンを導出関係(DR)として正式に定義し、導出能力(DC)の概念を導入する。 DCを評価するために,DeVALというシステム構築型評価フレームワークを提案し,主要な5つのLCMと1つの大規模推論モデルを7つのタスクで評価する。評価の結果, GPT-4o や Claude3.5 などの主要 LLM は中程度の DR 認識能力を示すが, 問題解決シナリオにおいてDR を効果的に適用する上で, 顕著な落差がみられた。これを改善するために、導出プロンプト(DP)と呼ばれる新しいプロンプトエンジニアリング手法を提案する。試験された全てのLLMに対して、DCの平均15.2%の改善を達成し、一般的に使用されるプロンプトエンジニアリング技術より優れている。

論文の概要: DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

関連論文リスト