Fugu-MT 論文翻訳(概要): An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

論文の概要: An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

arxiv url: http://arxiv.org/abs/2603.09701v1
Date: Tue, 10 Mar 2026 14:12:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.369791
Title: An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation
Title（参考訳）: マルチターンHuman-LLM協調符号生成におけるインタラクションスメルの実証的研究
Authors: Binquan Zhang, Li Zhang, Lin Shi, Song Wang, Yuwei Qian, Linhui Zhao, Fang Liu, An Fu, Yida Ye,
Abstract要約: 大規模言語モデル(LLM)はコード生成に革命をもたらし、静的ツールから動的対話インターフェースへと進化した。 LLMはスタンドアロンのコードスニペットを生成するのに非常に優れているが、拡張された相互作用の間はコンテキスト整合性を維持するのに苦労している。既存のベンチマークでは、インタラクション・スメル(Interaction Smells)と呼ばれるインタラクション・プロセス自体に潜む品質の問題を見越して、最終的な出力の関数的正しさを強調している。
参考スコア（独自算出の注目度）: 10.568269273364448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have revolutionized code generation, evolving from static tools into dynamic conversational interfaces that facilitate complex, multi-turn collaborative programming. While LLMs exhibit remarkable proficiency in generating standalone code snippets, they often struggle to maintain contextual consistency during extended interactions, creating significant obstacles in the collaboration process. Existing benchmarks primarily emphasize the functional correctness of the final output, overlooking latent quality issues within the interaction process itself, which we term Interaction Smells. In this paper, we conduct an empirical study on sampled real-word user-LLM interactions from WildChat and LMSYS-Chat-1M datasets to systematically investigate Interaction Smells in human-LLM code generation tasks from the perspectives of phenomena, distribution, and mitigation. First, we establish the first taxonomy of Interaction Smells by manually performing open card sorting on real-world interaction logs. This taxonomy categorizes Interaction Smells into three primary categories, i.e., User Intent Quality, Historical Instruction Compliance, and Historical Response Violation, comprising nine specific subcategories. Next, we quantitatively evaluate six mainstream LLMs (i.e., GPT-4o, DeepSeek-Chat, Gemini 2.5, Qwen2.5-32B, Qwen2.5-72B, and Qwen3-235B-a22b) to analyze the distribution of Interaction Smells across different models. Finally, we propose Invariant-aware Constraint Evolution (InCE), a multi-agent framework designed to improve multi-turn interaction quality through explicit extraction of global invariants and pre-generation quality audits. Experimental results on the extended WildBench benchmark demonstrate that this lightweight mitigation approach significantly improves the Task Success Rate and effectively suppresses the occurrence of Interaction Smells.
Abstract（参考訳）: 大規模言語モデル(LLM)はコード生成に革命をもたらし、静的ツールから動的対話インタフェースへと進化し、複雑なマルチターン協調プログラミングを促進する。 LLMはスタンドアロンのコードスニペットを生成するのに顕著な習熟度を示すが、拡張されたインタラクションの間、コンテキスト整合性を維持するのに苦慮し、コラボレーションプロセスに重大な障害を生じさせる。既存のベンチマークでは、インタラクション・スメル(Interaction Smells)と呼ばれるインタラクション・プロセス自体に潜む品質の問題を見越して、最終的な出力の関数的正しさを強調している。本稿では,WildChatとLMSYS-Chat-1Mデータセットを用いた実単語ユーザ-LLMインタラクションのサンプル実験を行い,現象,分布,緩和の観点から人間-LLMコード生成タスクにおけるインタラクション・スメルを系統的に検討する。まず,実世界の対話ログを手動でソートすることで,対話スメルの最初の分類法を確立する。この分類法は、インタラクション・スメルを3つの主要なカテゴリ、すなわちユーザ・インテント・クオリティ、歴史的インストラクション・コンプライアンス、そして9つの特定のサブカテゴリからなるヒストリカル・レスポンス・ヴァイオレーションに分類する。次に、6つの主要なLCM(GPT-4o, DeepSeek-Chat, Gemini 2.5, Qwen2.5-32B, Qwen2.5-72B, Qwen3-235B-a22b)を定量的に評価し、異なるモデル間でのインタラクション・スメルの分布を分析した。 Invariant-aware Constraint Evolution (InCE) は,グローバルな不変量や前世代品質監査を明示的に抽出することで,マルチターンインタラクション品質の向上を目的としたマルチエージェントフレームワークである。拡張WildBenchベンチマーク実験の結果、この軽量化手法はタスク成功率を大幅に改善し、インタラクション・スメルの発生を効果的に抑制することが示された。

論文の概要: An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

関連論文リスト