Related papers: Spatial Preference Rewarding for MLLMs Spatial Understanding

Spatial Preference Rewarding for MLLMs Spatial Understanding

URL: http://arxiv.org/abs/2510.14374v1
Date: Thu, 16 Oct 2025 07:16:18 GMT
Title: Spatial Preference Rewarding for MLLMs Spatial Understanding
Authors: Han Qiu, Peng Gao, Lewei Lu, Xiaoqin Zhang, Ling Shao, Shijian Lu,
Abstract summary: Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
Score: 92.25703021388142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

Related papers

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models [9.647551134303384]
Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements.<n>We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning.
arXiv Detail & Related papers (2025-09-29T21:32:54Z)
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding [47.400649582392255]
We use large language models (MLLMs) to explore a zero-shot solution in STVG.<n>We propose a MLLM-based zero-shot framework for STVG, which includes novel temporal-augmented assembling strategies.
arXiv Detail & Related papers (2025-09-18T17:35:50Z)
Spatio-Temporal LLM: Reasoning about Environments and Actions [6.341762228330488]
"S-temporal" prompts challenge current Multimodal Large Language Models (MLLMs)<n>We show that recent MLLMs indeed struggle to correctly answer "s-temporal" prompts.<n>We build on this dataset to develop two-temporal LLM baselines.
arXiv Detail & Related papers (2025-07-07T17:59:55Z)
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization [7.0683335354070085]
We analyze recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities.<n>We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios.
arXiv Detail & Related papers (2025-04-14T21:34:06Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding [19.510385758079966]
Large Language Model Aided Real-time Scene Recommendation(LARR) This paper introduces Large Language Model Aided Real-time Scene Recommendation(LARR)
arXiv Detail & Related papers (2024-08-21T10:56:26Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.