Related papers: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

URL: http://arxiv.org/abs/2508.13968v2
Date: Wed, 20 Aug 2025 17:53:09 GMT
Title: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Authors: Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal,
Abstract summary: Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0deg, 90deg, 180deg, and 270deg.<n>This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation.<n>We show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images.
Score: 59.830657530592255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

Related papers

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? [66.88619941063048]
We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
arXiv Detail & Related papers (2025-05-17T08:48:40Z)
Spectral State Space Model for Rotation-Invariant Visual Representation Learning [15.131672925920995]
State Space Models (SSMs) have emerged as an alternative to Vision Transformers (ViTs)<n>SSMs fail to identify relationships between conceptually related yet not adjacent patches.<n>Current vision-based SSMs are highly sensitive to transformations such as rotation.<n>We introduce Spectral VMamba, a novel approach that effectively captures the global structure within an image.
arXiv Detail & Related papers (2025-03-09T00:37:43Z)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.<n>Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z)
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting [49.32327147931905]
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussians from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the primitive multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering.
arXiv Detail & Related papers (2024-04-30T16:47:46Z)
Steerers: A framework for rotation equivariant keypoint descriptors [26.31402935889126]
Keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. We learn a linear transform in description space that encodes rotations of the input image. We obtain state-of-the-art results on the rotation invariant image matching benchmarks AIMS and Roto-360.
arXiv Detail & Related papers (2023-12-04T18:59:44Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem. In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images. The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z)
SphereSR: 360{\deg} Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation [27.10716804733828]
We propose a novel framework to generate a continuous spherical image representation from an LR 360degimage. Specifically, we first propose a feature extraction module that represents the spherical data based on icosahedron. We then propose a spherical local implicit image function (SLIIF) to predict RGB values at the spherical coordinates.
arXiv Detail & Related papers (2021-12-13T10:16:51Z)
Extreme Rotation Estimation using Dense Correlation Volumes [73.35119461422153]
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting. We observe that, even when images do not overlap, there may be rich hidden cues as to their geometric relationship. We propose a network design that can automatically learn such implicit cues by comparing all pairs of points between the two input images.
arXiv Detail & Related papers (2021-04-28T02:00:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.