Fugu-MT 論文翻訳(概要): VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

論文の概要: VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

arxiv url: http://arxiv.org/abs/2511.02778v1
Date: Tue, 04 Nov 2025 18:00:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:06.13535
Title: VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Title（参考訳）: VCode: 記号的視覚表現としてのSVGを用いたマルチモーダル符号化ベンチマーク
Authors: Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang,
Abstract要約: 我々はSVGコードをコンパクトで解釈可能で実行可能な視覚表現として提唱する。私たちは、マルチモーダル理解をコード生成として再設計するベンチマークであるVCodeを紹介します。 VCodeは、一般的なコモンセンス(MM-Vet)、専門分野(MMMU)、視覚中心の知覚(CV-Bench)の3つの領域をカバーする。
参考スコア（独自算出の注目度）: 51.95090758710288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
Abstract（参考訳）: コードはエージェント時代の推論とアクションのための正確で実行可能な媒体として登場した。しかし、進歩はプログラム合成やデバッギングといった言語中心のタスクに大きく焦点を合わせており、視覚中心のコーディングは過小評価されている。スケッチを人間がどう考えるかに触発されて、SVGコードはコンパクトで解釈可能で実行可能な視覚表現として推奨する。 VCodeは、マルチモーダル理解をコード生成として再構成するベンチマークである。画像が与えられたら、モデルは下流の推論に象徴的な意味を保持するSVGを生成する必要がある。 VCodeは、一般的なコモンセンス(MM-Vet)、専門分野(MMMU)、視覚中心の知覚(CV-Bench)の3つの領域をカバーする。シンボルの忠実度を評価するために,ポリシーモデルがレンダリングされたSVGに対する質問に回答する新しい評価プロトコルであるCodeVQAを提案する。実証的に、フロンティアのVLMは忠実なSVGを生成するのに苦労し、言語中心と視覚中心のコーディングの間に永続的なギャップが浮かび上がっている。このギャップを埋めるために、2つの軸に沿ってVLMを拡張するエージェントフレームワークであるVCoderを紹介します。一相違を反復的に分析し、SVGコードを洗練するリビジョンについて考えること。 (ii)Visual Toolsでは、モデル固有の能力を超えたオブジェクト、形状、テキストなどの構造化されたキューを検出器やパーザが提供します。ベンチマーク全体では、強力な推論能力を持つフロンティアVLMは全体としては良好だが、専門的な知識と3D推論には制限がある。 VCoderは、トップパフォーマンスのClaude-4-Opusよりも12.3ポイントの総合的なゲインを提供する。人間とVLMの両方がレンダリングされたSVGを悪化させ、その一貫性は象徴的な視覚表現の約束を明らかにする。ベンチマークとコードはhttps://github.com/CSU-JPG/VCode.comで公開されている。

論文の概要: VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

関連論文リスト