Fugu-MT 論文翻訳(概要): RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

論文の概要: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

arxiv url: http://arxiv.org/abs/2508.13968v1
Date: Tue, 19 Aug 2025 15:58:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:32.00586
Title: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Title（参考訳）: RotBench: 画像回転の同定によるマルチモーダル大言語モデルの評価
Authors: Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal,
Abstract要約: MLLM(Multimodal Large Language Models)は、0deg, 90deg, 180deg, 270degで回転した入力画像の向きを正確に識別する。このタスクは、方向に関係なく、回転キューを検出し、画像内の空間的関係を文脈化するための堅牢な視覚的推論機能を必要とする。 GPT-5, o3, Gemini-2.5-Pro など,最先端のオープンかつプロプライエタリなMLLM が入力画像の回転を確実に識別できないことを示す。
参考スコア（独自算出の注目度）: 59.830657530592255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)が0{\deg, 90{\deg, 180{\deg, 270{\deg, 270{\degで回転した入力画像の向きを正確に特定できる範囲について検討する。このタスクは、方向に関係なく、回転キューを検出し、画像内の空間的関係を文脈化するための堅牢な視覚的推論機能を必要とする。 MLLMをこれらの能力で評価するために、ライフスタイル、ポートレート、ランドスケープイメージで構成される350イメージのベンチマークであるRotBenchを紹介した。このタスクの比較的単純な性質にもかかわらず、GPT-5, o3, Gemini-2.5-Proなど、最先端のオープンでプロプライエタリなMLLMが入力画像の回転を確実に識別していないことを示す。キャプションや深度マップなど、補助的な情報を持つモデルの提供や、チェーン・オブ・シークレットのプロンプトの使用は、小さな、一貫性のない改善しか提供しない。その結果、ほとんどのモデルでは右サイドアップ(0{\deg})の画像が確実に識別できるが、特定のモデルではアップサイドダウン(180{\deg})の画像が特定できることがわかった。 90{\deg} と 270{\deg} を確実に区別することはできない。異なる方向で回転した画像を同時に示すことで、推論モデルの性能が適度に向上する一方、投票による修正された設定では、より弱いモデルのパフォーマンスが向上する。さらに,90{\degと270{\degの回転を区別するモデルの性能は,180{\degの画像の同定を大幅に改善したにもかかわらず,微調整では向上しないことを示した。これらの結果から, MLLMの空間的推論能力と回転同定における人間の知覚との間に有意な差が認められた。

論文の概要: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

関連論文リスト