Fugu-MT 論文翻訳(概要): SketchVLM: Vision language models can annotate images to explain thoughts and guide users

論文の概要: SketchVLM: Vision language models can annotate images to explain thoughts and guide users

arxiv url: http://arxiv.org/abs/2604.22875v2
Date: Tue, 28 Apr 2026 04:48:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 14:06:43.823214
Title: SketchVLM: Vision language models can annotate images to explain thoughts and guide users
Title（参考訳）: SketchVLM:視覚言語モデルはイメージに注釈を付け、思考を説明し、ユーザーを導く
Authors: Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen,
Abstract要約: SketchVLMは、視覚言語モデルが入力画像上に非破壊的かつ編集可能なオーバーレイを生成し、その答えを視覚的に説明できる、学習不要でモデルに依存しないフレームワークである。シングルターン生成はすでに高い精度とアノテーションの品質を実現しており、マルチターン生成は人間とAIのコラボレーションのさらなる機会を開く。
参考スコア（独自算出の注目度）: 16.25722200375932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.
Abstract（参考訳）: 画像に関する質問に答えるとき、人間は自然に推論を説明するために、ラベルを付け、描画する。対照的に、Gemini-3-Pro や GPT-5 のような現代の視覚言語モデル(VLM)はテキストのみに応答するが、ユーザによる検証は困難である。 SketchVLMは、VLMが入力画像上に非破壊的で編集可能なSVGオーバーレイを生成し、その答えを視覚的に説明できる、トレーニング不要でモデルに依存しないフレームワークである。視覚的推論(ナビゲーション、ボールドロップの軌跡予測、オブジェクトカウント)と描画(ラベリング、接続ドット、オブジェクト周りの図形など)にまたがる7つのベンチマークで、SketchVLMは画像編集や微調整によるスケッチのベースラインの最大1.48倍の精度で視覚的推論タスクの精度を向上し、モデルが述べた回答に忠実なアノテーションを生成する。シングルターン生成はすでに高い精度とアノテーションの品質を実現しており、マルチターン生成は人間とAIのコラボレーションのさらなる機会を開く。インタラクティブなデモとコードはhttps://sketchvlm.github.io/にある。

論文の概要: SketchVLM: Vision language models can annotate images to explain thoughts and guide users

関連論文リスト