Fugu-MT 論文翻訳(概要): Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

論文の概要: Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

arxiv url: http://arxiv.org/abs/2606.15932v2
Date: Tue, 16 Jun 2026 15:28:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.812961
Title: Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence
Title（参考訳）: NL2Codeを超えて - マルチモーダルコードインテリジェンスに関する構造化された調査
Authors: Xuanle Zhao, Qiushi Sun, Jingyu Xiao, Xuexin Liu, Haoyue Yang, Qiaosheng Chen, Xianzhen Luo, Jing Huang, Yufeng Zhong, Lei Chen, Shuai Fu, Zhenlin Wei, Jinhe Bi, Lei Jiang, Haibo Qiu, Siqi Yang, Peng Shi, Jian Hu, Zhixiong Zeng,
Abstract要約: このサーベイは、視覚的に接地された入力と出力の下でコードを生成し、編集し、洗練し、理屈を定めているシステムを調べます。まず、コードが各タスクで果たす役割によってフィールドを定式化します。次に、ベンチマークとメソッドをグラフィカルユーザインタフェース、システミックビジュアライゼーション、構造化グラフィックス、フロンティアタスクとフレームワークの4つのドメインにまとめます。
参考スコア（独自算出の注目度）: 31.954261925882452
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on \href{https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code}{GitHub}.
Abstract（参考訳）: LLM(Large Language Models)はテキストからコードへの合成が大幅に進歩しているが、実際のプログラミングタスクの多くは、スクリーンショット、チャート、ベクトル描画、ビデオ、インタラクティブな状態などの視覚的なアーティファクトによって意図を規定している。なぜなら、正確性は構文だけでなく、レイアウト、データセマンティクス、相互作用の振る舞い、実行後に適用されるドメイン固有の制約にも依存するためである。このサーベイはマルチモーダルコードインテリジェンス(Multimodal Code Intelligence)を調査し、視覚的に接地された入力と出力の下でコードを生成し、編集し、洗練し、あるいは推論するシステムをカバーしている。まず、コードが各タスクで果たす役割によってフィールドを定式化し、コードが描画されたアーティファクト、編集可能なシンボル構造、科学的表現、中間的推論トレース、実行可能なポリシーまたはツールインターフェースとして区別する。次に、ベンチマークとメソッドをグラフィカルユーザインタフェース、システミックビジュアライゼーション、構造化グラフィックス、フロンティアタスクとフレームワークの4つのドメインにまとめます。この分類法は、成熟したアーティファクト生成問題と、新たなエージェント的かつ統一的な設定を結びつけ、異なるタスクが正当性の証拠をどのように扱うかを比較することができる。今後の研究は、検証中心の4つの方向から恩恵を受ける可能性がある、と私たちは主張する。マルチシグナル検証は、相補的な正当性の証拠を組み合わせることができ、マルチステート検証は実行軌跡間の動作をテストすることができ、クロスタスク転送テストは再利用可能なビジュアルコードスキルを探索し、検証可能なエージェントトレースは、エージェントアクションが視覚的エビデンスに根ざされているかどうかを明らかにすることができる。同時に、これらの方向は、この場を単一出力の模倣からエビデンス基底の実行可能なシステムへと移動させる。進行中のプロジェクトとリソースは、 \href{https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code}{GitHub}で入手できる。

論文の概要: Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

関連論文リスト