Fugu-MT 論文翻訳(概要): The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

論文の概要: The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

arxiv url: http://arxiv.org/abs/2606.01901v1
Date: Mon, 01 Jun 2026 08:42:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.624412
Title: The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
Title（参考訳）: 画像再構成ゲーム: 反復的マルチモーダル対話による共通グラウンドの描画
Authors: Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen,
Abstract要約: 我々は、視覚言語モデルが複数のターンにまたがる画像生成装置に補正命令を発行する完全自動ベンチマークである画像再構成ゲームを紹介した。また, 再現品質の要因として, 再現性向上の助けとなるか, 痛みを伴うか, ジェネレータが決定する。
参考スコア（独自算出の注目度）: 15.768100289136392
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.
Abstract（参考訳）: 視覚言語モデルが複数回にわたって画像生成器に補正命令を発行する完全自動ベンチマークである画像再構成ゲームを導入し、蓄積された共通グラウンドをレンダリング画像として直接観測できるようにする。 7つの画像カテゴリにわたる2つのジェネレータモデルと交差する2つのディスクリプタモデルをベンチマークした結果、リコンストラクタがリコンストラクション品質の主要な要因であることが判明した。数学的、幾何学的なイメージは最大の課題を浮き彫りにする。より短い予算は、目に見える改善の余地のあるスペーサーの最初のレンダリングをもたらすが、長い予算は絶対的な品質を高めるが、修正する余地は少ない。より強い記述者は、空間的、数値的、構造的なカテゴリにまたがるよりリッチな修正語彙を使い、弱い記述者は表面の性質に集中し、数回転後に停止する傾向がある。人間による検証では、最良の自動判断者は、人間の好みとわずかに一致しただけで、自動化されたスコアは、人間の再校正を確実に行う必要がある。

論文の概要: The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

関連論文リスト