Fugu-MT 論文翻訳(概要): CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

論文の概要: CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

arxiv url: http://arxiv.org/abs/2402.14809v1
Date: Thu, 22 Feb 2024 18:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-23 13:54:52.370959
Title: CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
Title（参考訳）: criticbench: 批判的正しい推論のためのllmベンチマーク
Authors: Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, Yujiu Yang
Abstract要約: CriticBenchは、大規模言語モデルの推論を批判し修正する能力を評価するために設計されたベンチマークである。生成, 批判, 修正推論における17個のLLMの性能を評価し, 評価した。
参考スコア（独自算出の注目度）: 28.028010138432776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.
Abstract（参考訳）: 大規模言語モデル(LLM)がそれらの推論を批判し、洗練する能力は、評価、フィードバックのプロビジョニング、自己改善において非常に重要である。本稿では,llms のさまざまなタスクに対する批判的・正当化能力を評価するための総合ベンチマークである criticbench について紹介する。 CriticBenchは数学、常識、記号、コーディング、アルゴリズムの5つの推論領域を含んでいる。 15のデータセットをコンパイルし、3つのLLMファミリーからのレスポンスを組み込む。批判ベンチを活用し,世代,批判,訂正推論,すなわちgqc推論における17llmの性能を評価し,分析する。 Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. LLMの微妙な批判的正しい推論に対するこれらの洞察が、LCM批判と自己改善のさらなる研究を促進することを願っている。

関連論文リスト

DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
我々は,Large Language Models(LLMs)の数学批判能力の研究と向上に焦点をあてる。 Qwen2.5-7B-Instructをベースとした批判モデルを開発した。
論文参考訳（メタデータ） (2025-05-01T17:03:17Z)
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
我々は,Large Language Models (LLMs) の批判能力を評価するために設計された新しいベンチマークを導入する。通常、オープンループ方式で機能する既存のベンチマークとは異なり、我々のアプローチでは、批判から生成された修正の質を評価するクローズドループ手法を採用している。
論文参考訳（メタデータ） (2025-01-24T13:48:10Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT(Self-evolving CRITic)は、批評能力の真の自己進化を可能にするフレームワークである。コントラストベースの自己批判によって生成される合成データのトレーニングによって自己改善する。最大で10.3%の改善が達成されている。
論文参考訳（メタデータ） (2025-01-10T05:51:52Z)
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
我々は,LVLMの細粒度評価と補正能力を広範囲に解析する最初のベンチマークであるVISCOを提案する。 VISCOは密度が高くきめ細かな批判を特徴とし、LVLMは各ステップの正しさを評価する必要がある。 LookBackは、批評と修正のパフォーマンスを最大13.5%改善する。
論文参考訳（メタデータ） (2024-12-03T05:04:49Z)
Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RMは、余分な監督なしに自己生成した批評を使って報酬モデルを改善するフレームワークである。実験の結果、Critic-RMは標準報酬モデルやLLM審査員と比較して報酬モデリングの精度を3.7%-7.3%改善していることがわかった。
論文参考訳（メタデータ） (2024-11-25T18:28:26Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
本稿では、推論と批判モデルの役割を分離する2人プレイヤパラダイムを提案する。まず、批判データを収集する自動化およびスケーラブルなフレームワークであるAutoMathCritiqueを提案する。テスト時間における難解なクエリに対するアクターのパフォーマンスを,批判モデルが一貫して改善することが実証された。
論文参考訳（メタデータ） (2024-11-25T17:11:54Z)
Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique パイプラインはマルチエージェントフィードバックを利用することで LLM の批判能力を向上させる。パイプラインは、単一のモデルではなく、複数のエージェントからの高品質な批評を集約する。我々の微調整された7Bモデルは、他の高度な7B-13Bオープンソースモデルを大きく上回っている。
論文参考訳（メタデータ） (2024-10-20T04:57:45Z)
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoTは、LLMをSystem-2のような批判能力にプッシュするフレームワークである。人間のアノテーションを使わずにCoT推論パラダイムと遠隔スーパービジョンデータの自動構築 GSM8KとMATHの実験は、我々の強化されたモデルがタスク解決性能を大幅に向上させることを示した。
論文参考訳（メタデータ） (2024-08-29T08:02:09Z)
CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEvalは、大規模言語モデルの批判能力を包括的かつ確実に評価するように設計された、新しいベンチマークである。包括性を確保するため、CriticalEvalは9つの異なるタスクシナリオの4次元から批判能力を評価する。信頼性を確保するため、多数の批判が注釈付けされ、参照として機能する。
論文参考訳（メタデータ） (2024-02-21T12:38:59Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
本稿では,新しいタスク,CLOMO(Counterfactual Logical Modification)と高品質な人間アノテーションベンチマークを紹介する。このタスクでは、LLMは所定の論理的関係を維持するために、与えられた議論的テキストを順応的に変更しなければなりません。 LLMの自然言語出力を直接評価する革新的な評価指標である自己評価スコア(SES)を提案する。
論文参考訳（メタデータ） (2023-11-29T08:29:54Z)
Critique Ability of Large Language Models [38.34144195927209]
本研究では,大規模言語モデル(LLM)が様々なタスクに対して正確な批評を提供する能力について検討する。我々は,高品質な自然言語クエリとそれに対応するモデル応答からなるCriticBenchというベンチマークを開発した。
論文参考訳（メタデータ） (2023-10-07T14:12:15Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。