Related papers: Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

URL: http://arxiv.org/abs/2401.17862v1
Date: Wed, 31 Jan 2024 14:21:49 GMT
Title: Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Authors: Jianing Li, Xi Nan, Ming Lu, Li Du, Shanghang Zhang
Abstract summary: Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities. Proximity QA is a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis.
Score: 45.62657605766754
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities, primarily due to the exceptional in-context understanding and multi-task learning strengths of large language models (LLMs). The advent of visual instruction tuning has further enhanced MLLMs' performance in vision-language understanding. However, while existing MLLMs adeptly recognize \textit{what} objects are in an image, they still face challenges in effectively discerning \textit{where} these objects are, particularly along the distance (scene depth) axis. To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. The framework operates in two phases: the first phase focuses on guiding the models to understand the relative depth of objects, and the second phase further encourages the models to infer the proximity relationships between objects based on their depth perceptions. We also propose a VQA dataset called Proximity-110K, containing additional instructions that incorporate depth information and the proximity relationships of objects. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis, outperforming other state-of-the-art MLLMs. Code and dataset will be released at \textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.

Related papers

CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography [12.305953690308085]
Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, have opened this capability. We focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics interplay with the camera parameters.
arXiv Detail & Related papers (2025-04-14T10:53:44Z)
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing [3.3072144045024396]
EagleVision is an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. We construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks.
arXiv Detail & Related papers (2025-03-30T06:13:13Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models [6.695747085909927]
We introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G employs expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images. We develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images.
arXiv Detail & Related papers (2024-03-28T11:26:30Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.