Fugu-MT 論文翻訳(概要): CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

論文の概要: CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

arxiv url: http://arxiv.org/abs/2603.07886v1
Date: Mon, 09 Mar 2026 01:49:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.33725
Title: CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
Title（参考訳）: CCR-Bench: 複雑な制約, 制御フロー, 実世界におけるLCM評価のための総合ベンチマーク
Authors: Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng,
Abstract要約: CCR-Benchは、大規模言語モデルの複雑な命令への準拠を評価するために設計された新しいベンチマークである。 CCR-Benchは、(1)タスク仕様における内容とフォーマット要件の深い絡み合い、(2)複雑なタスクの分解、条件付き推論、手続き計画を含む指示、(3)実世界の産業シナリオから派生した評価サンプルを特徴とする。
参考スコア（独自算出の注目度）: 40.58765467531474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.
Abstract（参考訳）: 大規模言語モデル(LLM)の複雑な命令に従う能力を強化することは、現実世界のアプリケーションに展開する上で重要である。しかし、既存の評価手法は、命令の複雑さを単にアトミック制約の付加的な組み合わせとして単純化し、コンテンツとフォーマットの複雑な相互作用、論理的ワークフロー制御、実世界のアプリケーションから生じる高次元の複雑さを適切に捉えることができない。これにより、現在の評価プラクティスと実践的な要求との間に大きなギャップが生まれます。このギャップを埋めるために、複雑な命令に対するLLMの適合性を評価するために設計された新しいベンチマークであるCCR-Benchを導入する。 CCR-Benchは、(1)タスク仕様における内容とフォーマット要件の深い絡み合い、(2)複雑なタスクの分解、条件付き推論、手続き計画を含む指示、(3)実世界の産業シナリオから派生した評価サンプルを特徴とする。 CCR-Benchに関する大規模な実験では、最先端のモデルでさえかなりの性能上の欠陥を示し、現在のLLM能力と現実世界の命令理解の要求とのギャップを明確に定量化している。我々は、CCR-Benchがより厳密で現実的な評価フレームワークを提供し、産業アプリケーションにおける複雑なタスクの理解と実行が可能な次世代モデルに向けたLCMの開発を推進していると信じている。

論文の概要: CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

関連論文リスト