Fugu-MT 論文翻訳(概要): Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

論文の概要: Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

arxiv url: http://arxiv.org/abs/2510.21513v1
Date: Fri, 24 Oct 2025 14:39:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-27 15:45:42.336972
Title: Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Title（参考訳）: コード生成と修復のためのLCMアンサンブルの知恵と妄想
Authors: Fernando Vallecillos Ruiz, Max Hort, Leon Moonen,
Abstract要約: 3つのソフトウェアエンジニアリングベンチマークで10個の大規模言語モデルと3つのLLMのアンサンブルを比較した。アンサンブルのパフォーマンスの理論的上限は、最高のシングルモデルよりも83%高いことが判明した。多様性に基づく戦略は、この理論ポテンシャルの最大95%を実現し、小さな2モデルアンサンブルでも有効であることを示す。
参考スコア（独自算出の注目度）: 45.969630994412846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
Abstract（参考訳）: 今日のソフトウェアエンジニアリングタスクの1つのLMM(Large Language Model)の追求はリソース集約的であり、異なるモデルが独自の強みをもたらす相補性の潜在的なメリットを見落としている。しかし、コーディングLLMが相互に補完する程度や、アンサンブルのポテンシャルを最大化するための最善の戦略は不明確であり、実践者が単一モデルシステムを超えて進むための明確な道のりは残っていない。このギャップに対処するために、私たちは、コード生成とプログラム修復をカバーする3つのソフトウェアエンジニアリングベンチマークにおいて、5つのファミリーから10個のLLMと3つのLLMのアンサンブルを経験的に比較した。モデル間の相補性と、最高の個人モデルとアンサンブル間の性能ギャップを評価する。次に、アンサンブルの候補プールから正しい解を同定するために、様々な選択ヒューリスティックスを評価する。アンサンブルのパフォーマンスの理論的上限は、最高のシングルモデルよりも83%高いことが判明した。この結果から, コンセンサスに基づくソリューション選択戦略は, 共通かつ誤ったアウトプットを増幅する「人気トラップ」に陥ることが示唆された。対照的に、多様性に基づく戦略は、この理論的ポテンシャルの最大95%を実現し、小さな2モデルアンサンブルでも有効であることを証明し、複数のLLMを活用することで、コスト効率の高い性能向上を可能にする。

論文の概要: Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

関連論文リスト