Fugu-MT 論文翻訳(概要): Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

論文の概要: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

arxiv url: http://arxiv.org/abs/2511.20613v1
Date: Tue, 25 Nov 2025 18:40:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.619986
Title: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Title（参考訳）: バイブコーディングはCSの大学院生に勝てるか? 市場主導型戦略計画に関するLLM vs. ヒューマン・コーディング・トーナメント
Authors: Panayiotis Danassis, Naman Goel,
Abstract要約: 大規模言語モデル(LLM)は、AI支援コード生成に革命をもたらした。 LLMは、適切にベンチマークする能力よりも優れています。本稿では,実世界のロジスティクス最適化問題に基づくマルチエージェント推論駆動ベンチマークを提案する。
参考スコア（独自算出の注目度）: 5.003221934057385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
Abstract（参考訳）: LLM(Large Language Models)の急速な普及は、AI支援コード生成に革命をもたらした。 LLMの急速な開発は、それらを適切にベンチマークする能力を上回っました。一般的なベンチマークでは、単体テストのパスレートと構文的正確性を強調している。このようなメトリクスは、計画、最適化、戦略的相互作用を必要とする多くの現実世界の問題の難しさを浮き彫りにしている。本稿では,実世界の物流最適化問題 (オークション, ピックアップ, 配送問題) に基づくマルチエージェント推論駆動ベンチマークを提案する。ベンチマークには可能なビルディングエージェントが必要です一不確実で戦略的に入札すること (二)利益を最大化しつつタスクを遂行するプランナーを最適化すること。 LLMの出現以前に開発された17個の人為的エージェントに対して40個のLLM符号化エージェント(バイブ符号化を含む複数のプロンプト手法で多種多様なLLMを用いて評価を行った。 12のオールプレイ・オールトーナメントと$\sim 40$kの試合で得られた結果が証明された (i)人間(大学生)にコードされたエージェントの明確な優位性:トップ5のスポットは、一貫して人間にコードされたエージェントによって獲得される。 (二) LLM符号化剤の大多数(40点中33点)は、非常に単純な基線で打ち負かされ、三インプットとして最高の人間の解が与えられ、改善を促すことにより、最高の性能のLCMは、その解を改善せずに著しく悪化させる。我々の結果は、LLMが現実世界で競争力のあるコードを生成する能力のギャップを浮き彫りにし、現実のシナリオにおける推論駆動のコード合成を強調する新しい評価を動機付けている。

論文の概要: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

関連論文リスト