Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis
- URL: http://arxiv.org/abs/2401.17862v1
- Date: Wed, 31 Jan 2024 14:21:49 GMT
- Title: Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis
- Authors: Jianing Li, Xi Nan, Ming Lu, Li Du, Shanghang Zhang
- Abstract summary: Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities.
Proximity QA is a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images.
We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis.
- Score: 45.62657605766754
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal large language models (MLLMs) have demonstrated remarkable
vision-language capabilities, primarily due to the exceptional in-context
understanding and multi-task learning strengths of large language models
(LLMs). The advent of visual instruction tuning has further enhanced MLLMs'
performance in vision-language understanding. However, while existing MLLMs
adeptly recognize \textit{what} objects are in an image, they still face
challenges in effectively discerning \textit{where} these objects are,
particularly along the distance (scene depth) axis. To overcome this limitation
in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel
framework designed to enable MLLMs to infer the proximity relationship between
objects in images. The framework operates in two phases: the first phase
focuses on guiding the models to understand the relative depth of objects, and
the second phase further encourages the models to infer the proximity
relationships between objects based on their depth perceptions. We also propose
a VQA dataset called Proximity-110K, containing additional instructions that
incorporate depth information and the proximity relationships of objects. We
have conducted extensive experiments to validate Proximity QA's superior
ability in depth perception and proximity analysis, outperforming other
state-of-the-art MLLMs. Code and dataset will be released at
\textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models [6.695747085909927]
We introduce P2G, a novel framework for plug-and-play grounding in MLLMs.
P2G employs expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images.
We develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images.
arXiv Detail & Related papers (2024-03-28T11:26:30Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.