Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis
- URL: http://arxiv.org/abs/2401.17862v1
- Date: Wed, 31 Jan 2024 14:21:49 GMT
- Title: Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis
- Authors: Jianing Li, Xi Nan, Ming Lu, Li Du, Shanghang Zhang
- Abstract summary: Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities.
Proximity QA is a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images.
We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis.
- Score: 45.62657605766754
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal large language models (MLLMs) have demonstrated remarkable
vision-language capabilities, primarily due to the exceptional in-context
understanding and multi-task learning strengths of large language models
(LLMs). The advent of visual instruction tuning has further enhanced MLLMs'
performance in vision-language understanding. However, while existing MLLMs
adeptly recognize \textit{what} objects are in an image, they still face
challenges in effectively discerning \textit{where} these objects are,
particularly along the distance (scene depth) axis. To overcome this limitation
in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel
framework designed to enable MLLMs to infer the proximity relationship between
objects in images. The framework operates in two phases: the first phase
focuses on guiding the models to understand the relative depth of objects, and
the second phase further encourages the models to infer the proximity
relationships between objects based on their depth perceptions. We also propose
a VQA dataset called Proximity-110K, containing additional instructions that
incorporate depth information and the proximity relationships of objects. We
have conducted extensive experiments to validate Proximity QA's superior
ability in depth perception and proximity analysis, outperforming other
state-of-the-art MLLMs. Code and dataset will be released at
\textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.
Related papers
- MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image [16.040813949620958]
We introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis.
Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism.
This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks.
arXiv Detail & Related papers (2024-11-25T09:00:36Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models [6.695747085909927]
We introduce P2G, a novel framework for plug-and-play grounding in MLLMs.
P2G employs expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images.
We develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images.
arXiv Detail & Related papers (2024-03-28T11:26:30Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.