Scaffolding Coordinates to Promote Vision-Language Coordination in Large
Multi-Modal Models
- URL: http://arxiv.org/abs/2402.12058v1
- Date: Mon, 19 Feb 2024 11:23:53 GMT
- Title: Scaffolding Coordinates to Promote Vision-Language Coordination in Large
Multi-Modal Models
- Authors: Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li and Yang Liu
- Abstract summary: State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional capabilities in vision-language tasks.
Existing prompting techniques for LMMs focus on either improving textual reasoning or leveraging tools for image preprocessing.
We propose Scaffold prompting that scaffolds coordinates to promote vision-language coordination.
- Score: 18.772045053892885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated
exceptional capabilities in vision-language tasks. Despite their advanced
functionalities, the performances of LMMs are still limited in challenging
scenarios that require complex reasoning with multiple levels of visual
information. Existing prompting techniques for LMMs focus on either improving
textual reasoning or leveraging tools for image preprocessing, lacking a simple
and general visual prompting scheme to promote vision-language coordination in
LMMs. In this work, we propose Scaffold prompting that scaffolds coordinates to
promote vision-language coordination. Specifically, Scaffold overlays a dot
matrix within the image as visual information anchors and leverages
multi-dimensional coordinates as textual positional references. Extensive
experiments on a wide range of challenging vision-language tasks demonstrate
the superiority of Scaffold over GPT-4V with the textual CoT prompting. Our
code is released in https://github.com/leixy20/Scaffold.
Related papers
- Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models [27.45225442048711]
We introduce CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks.
We also present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm.
arXiv Detail & Related papers (2024-10-21T16:30:29Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.