Fugu-MT 論文翻訳(概要): Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

論文の概要: Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

arxiv url: http://arxiv.org/abs/2606.07641v1
Date: Mon, 01 Jun 2026 14:16:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.175327
Title: Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models
Title（参考訳）: 可読性はまだ予測不可能:視覚言語モデルにおける回転出力予測
Authors: Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng,
Abstract要約: 視覚言語モデルでは、オリジナルの画像だけで180回転を予測できるのか? 我々はこの能力について,回転アウトカム予測を用いて検討する。現在の視覚言語モデルは、表示された時に変換された視覚状態を認識できるが、しばしば元のビューからその状態を予測できない。
参考スコア（独自算出の注目度）: 82.10380665213867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.
Abstract（参考訳）: 視覚言語モデルでは、元の画像だけで180°回転がどうなるかを予測できますか? 原画像が与えられた場合、モデルは回転対象を直接観察することなく、180°の面内回転の後に何が見えるか、あるいは読み込まれるかに答えなければならない。このギャップを解消するために、オープンな視覚ケースと制御されたテキストイメージローテーションにまたがる2つの診断ベンチマークであるRotOutBenchを紹介する。シャープなパターンが現れる: 多くのVLMは、原画像または回転した画像のどちらかを直接与えたときに関連コンテンツを認識することができるが、原画像のみから回転した結果を推測することができない。制御されたテキスト画像回転では、直接読影精度の高いモデルであっても、予測回転精度がほぼゼロに崩壊する。さらに、モデルレベルのケーススタディでは、予測状態が回転画像読解状態に近づき、最終的な読み出しは元の文字列にシフトすることを示した。現在のVLMは、表示された時に変換された視覚状態を認識できるが、しばしば元のビューからその状態を予測できない。

論文の概要: Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

関連論文リスト