Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
- URL: http://arxiv.org/abs/2404.07449v1
- Date: Thu, 11 Apr 2024 03:09:34 GMT
- Title: Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
- Authors: Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin,
- Abstract summary: Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks.
However, existing V-LLMs demonstrate weak spatial reasoning and localization awareness.
We explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs.
- Score: 38.02017186215372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Related papers
- Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.
We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - FiVL: A Framework for Improved Vision-Language Alignment [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.
These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence.
To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability.
arXiv Detail & Related papers (2024-12-19T09:24:10Z) - Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS [68.47681139026666]
Video object segmentation (VOS) is a crucial task in computer vision.
Current VOS methods struggle with complex scenes and prolonged object motions.
This report introduces a discriminative spatial-temporal VOS model.
arXiv Detail & Related papers (2024-08-29T10:47:17Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning [22.93684323791136]
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.
We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance ICCC's zero-shot performance without the need for labeled task.
Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based tasks through ICCC instruction tuning.
arXiv Detail & Related papers (2024-04-01T04:28:01Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Learning to Relate Depth and Semantics for Unsupervised Domain
Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting.
We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions.
Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.