VCoder: Versatile Vision Encoders for Multimodal Large Language Models
- URL: http://arxiv.org/abs/2312.14233v1
- Date: Thu, 21 Dec 2023 18:49:47 GMT
- Title: VCoder: Versatile Vision Encoders for Multimodal Large Language Models
- Authors: Jitesh Jain, Jianwei Yang, Humphrey Shi
- Abstract summary: Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks.
However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail.
We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
- Score: 46.95488342139727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans possess the remarkable skill of Visual Perception, the ability to see
and understand the seen, helping them make sense of the visual world and, in
turn, reason. Multimodal Large Language Models (MLLM) have recently achieved
impressive performance on vision-language tasks ranging from visual
question-answering and image captioning to visual reasoning and image
generation. However, when prompted to identify or count (perceive) the entities
in a given image, existing MLLM systems fail. Working towards developing an
accurate MLLM system for perception and reasoning, we propose using Versatile
vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the
VCoder with perception modalities such as segmentation or depth maps, improving
the MLLM's perception abilities. Secondly, we leverage the images from COCO and
outputs from off-the-shelf vision perception models to create our COCO
Segmentation Text (COST) dataset for training and evaluating MLLMs on the
object perception task. Thirdly, we introduce metrics to assess the object
perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive
experimental evidence proving the VCoder's improved object-level perception
skills over existing Multimodal LLMs, including GPT-4V. We open-source our
dataset, code, and models to promote research. We open-source our code at
https://github.com/SHI-Labs/VCoder
Related papers
- Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.