Fugu-MT 論文翻訳(概要): When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

論文の概要: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

arxiv url: http://arxiv.org/abs/2509.21051v1
Date: Thu, 25 Sep 2025 12:01:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.884427
Title: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Title（参考訳）: インストラクション乗算:複数インストラクションのLCM能力の測定と推定
Authors: Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Edison Marrese-Taylor, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo,
Abstract要約: 本稿では,複数の命令が重要である基本領域に対する2つの特別なベンチマークを紹介する。命令数が増えるにつれて、性能が一貫して低下することを示す。本稿では,説明変数として命令数を用いたロジスティック回帰モデルを用いて,約10%の誤差で複数の命令を追従する性能を予測できることを実証する。
参考スコア（独自算出の注目度）: 42.08242599538887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.
Abstract（参考訳）: 大規模言語モデル(LLM)が現実のシナリオにますます適用されるにつれて、複数の命令を同時に従う能力を理解することが重要である。これらの機能を体系的に評価するために,複数の命令が重要となる基本領域に対して,最大10命令のテキスト生成のための多くの命令追跡Eval (ManyIFEval) と,最大6命令のコード生成のためのStyleMBPP (StyleMBPP) の2つの特別なベンチマークを導入する。 10個のLSMにまたがるベンチマークを用いて行った実験により,命令数の増加に伴って性能が一貫して低下することが判明した。さらに、実例では、複数の命令の組み合わせがすべて計算不可能であるという事実を踏まえ、未確認の命令の組み合わせと訓練中に使用されていない命令数の両方のパフォーマンスを推定できる3種類の回帰モデルを開発した。説明変数として命令数を用いたロジスティック回帰モデルでは、未知の命令の組み合わせであっても、約10%の誤差で複数の命令を追従する性能を予測できることを実証する。比較的控えめなサンプルサイズ(MultiIFEvalは500、StyleMBPPは300)が性能評価に十分であることを示す。

論文の概要: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

関連論文リスト