Large Language Models are Visual Reasoning Coordinators
- URL: http://arxiv.org/abs/2310.15166v1
- Date: Mon, 23 Oct 2023 17:59:31 GMT
- Title: Large Language Models are Visual Reasoning Coordinators
- Authors: Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt
Keutzer, Trevor Darrell, Ziwei Liu
- Abstract summary: We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
- Score: 144.67558375045755
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual reasoning requires multimodal perception and commonsense cognition of
the world. Recently, multiple vision-language models (VLMs) have been proposed
with excellent commonsense reasoning ability in various domains. However, how
to harness the collective power of these complementary VLMs is rarely explored.
Existing methods like ensemble still struggle to aggregate these models with
the desired higher-order communications. In this work, we propose Cola, a novel
paradigm that coordinates multiple VLMs for visual reasoning. Our key insight
is that a large language model (LLM) can efficiently coordinate multiple VLMs
by facilitating natural language communication that leverages their distinct
and complementary capabilities. Extensive experiments demonstrate that our
instruction tuning variant, Cola-FT, achieves state-of-the-art performance on
visual question answering (VQA), outside knowledge VQA, visual entailment, and
visual spatial reasoning tasks. Moreover, we show that our in-context learning
variant, Cola-Zero, exhibits competitive performance in zero and few-shot
settings, without finetuning. Through systematic ablation studies and
visualizations, we validate that a coordinator LLM indeed comprehends the
instruction prompts as well as the separate functionalities of VLMs; it then
coordinates them to enable impressive visual reasoning capabilities.
Related papers
- Enhancing Advanced Visual Reasoning Ability of Large Language Models [20.32900494896848]
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning.
We propose Complex Visual Reasoning Large Language Models (CVR-LLM)
Our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop.
We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning.
arXiv Detail & Related papers (2024-09-21T02:10:19Z) - Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators [0.0]
We show how a large language model (LLM) can efficiently coordinate multiple vision-language models (VLMs) through natural language communication.
Our study investigates whether the same methodology can be applied to surveillance videos for action recognition.
arXiv Detail & Related papers (2024-07-20T10:26:28Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.