Scaffolding Coordinates to Promote Vision-Language Coordination in Large
Multi-Modal Models
- URL: http://arxiv.org/abs/2402.12058v1
- Date: Mon, 19 Feb 2024 11:23:53 GMT
- Title: Scaffolding Coordinates to Promote Vision-Language Coordination in Large
Multi-Modal Models
- Authors: Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li and Yang Liu
- Abstract summary: State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional capabilities in vision-language tasks.
Existing prompting techniques for LMMs focus on either improving textual reasoning or leveraging tools for image preprocessing.
We propose Scaffold prompting that scaffolds coordinates to promote vision-language coordination.
- Score: 18.772045053892885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated
exceptional capabilities in vision-language tasks. Despite their advanced
functionalities, the performances of LMMs are still limited in challenging
scenarios that require complex reasoning with multiple levels of visual
information. Existing prompting techniques for LMMs focus on either improving
textual reasoning or leveraging tools for image preprocessing, lacking a simple
and general visual prompting scheme to promote vision-language coordination in
LMMs. In this work, we propose Scaffold prompting that scaffolds coordinates to
promote vision-language coordination. Specifically, Scaffold overlays a dot
matrix within the image as visual information anchors and leverages
multi-dimensional coordinates as textual positional references. Extensive
experiments on a wide range of challenging vision-language tasks demonstrate
the superiority of Scaffold over GPT-4V with the textual CoT prompting. Our
code is released in https://github.com/leixy20/Scaffold.
Related papers
- AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.