Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes
- URL: http://arxiv.org/abs/2412.11396v1
- Date: Mon, 16 Dec 2024 02:52:19 GMT
- Title: Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes
- Authors: Antonio Carlos Rivera, Anthony Moore, Steven Robinson,
- Abstract summary: Vision-Aware Retrieval-Augmented Prompting (VRAP) is a generative approach that enhances Large Vision-Language Models.
VRAP achieves state-of-the-art performance in fine-grained reasoning and multimodal understanding.
- Score: 0.0
- License:
- Abstract: Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.
Related papers
- DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests [69.00444996464662]
We present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios.
Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings.
We investigate training strategies that leverage relevant entities to improve visual reasoning.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge [24.538839144639653]
Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components.
These models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM)
arXiv Detail & Related papers (2024-11-25T18:33:14Z) - Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs)
LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts.
This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z) - Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models [69.79709804046325]
We introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination.
R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension.
We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.
arXiv Detail & Related papers (2024-06-24T08:42:42Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Zero-shot Visual Relation Detection via Composite Visual Cues from Large
Language Models [44.60439935450292]
We propose a novel method for zero-shot visual recognition: RECODE.
It decomposes each predicate category into subject, object, and spatial components.
Different visual cues enhance the discriminability of similar relation categories from different perspectives.
arXiv Detail & Related papers (2023-05-21T14:40:48Z) - Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.