Fugu-MT 論文翻訳(概要): The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

論文の概要: The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

arxiv url: http://arxiv.org/abs/2506.20803v1
Date: Wed, 25 Jun 2025 19:47:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-27 19:53:09.871337
Title: The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Title（参考訳）: 理想実行ギャップ: LLM生成と人間研究の理念の実践結果
Authors: Chenglei Si, Tatsunori Hashimoto, Diyi Yang,
Abstract要約: 良いアイデアは単に斬新なものではなく、実行後により良い研究がもたらされるべきである。 AIが生み出すアイデアがより良い研究成果をもたらすかどうかをテストするために、我々は実行研究を行う。実行前後の同じアイデアのレビュースコアを比較すると、LLM生成のアイデアのスコアは専門家によるアイデアよりも大幅に減少する。
参考スコア（独自算出の注目度）: 90.26363107905344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
Abstract（参考訳）: LLM(Large Language Models)は、科学研究パイプラインの加速を約束している。このプロセスの重要な能力は、新しい研究のアイデアを生成する能力であり、先行研究は、LLMが生成した研究のアイデアが人間の専門的なアイデアよりも新しいものと判断された設定を見出した。しかし、善良な考えは単に斬新なものではなく、実行後により良い研究がもたらされるべきである。 AIによって生成されたアイデアがより良い研究結果をもたらすかどうかをテストするため、43人の専門家研究者を雇い、専門家によって書かれたり、LLMによって生成されたりするランダムなアイデアを実行する。各専門家は100時間以上かけてアイデアを実装し、実験を文書化する4ページの短い論文を書いた。実行されたすべてのプロジェクトは、専門家のNLP研究者によって盲目的にレビューされる。実行前後の同じアイデアのレビュースコアを比較すると、LLM生成のアイデアのスコアは、すべての評価指標(ノベルティ、興奮、有効性、全体、p < 0.05)に関する専門家によるアイデアよりも大幅に減少する。実施調査から集計されたレビュースコアを比較すると、多くの指標において、人間のアイデアがLLMのアイデアよりも高い評価を得られるランク付けがあるのが分かる。このアイデア-実行ギャップは、真に効果的な研究アイデアを生成する上での現在のLLMの限界と、実行結果の欠如において研究アイデアを評価することの課題を浮き彫りにする。

論文の概要: The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

関連論文リスト