Osprey: Pixel Understanding with Visual Instruction Tuning
- URL: http://arxiv.org/abs/2312.10032v3
- Date: Thu, 14 Mar 2024 15:50:17 GMT
- Title: Osprey: Pixel Understanding with Visual Instruction Tuning
- Authors: Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu,
- Abstract summary: Osprey is a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction.
To achieve this goal, we first curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM.
Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input.
- Score: 15.094943732551018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
Related papers
- GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [22.729750410621826]
GeoPix is a RS MLLM that extends image understanding capabilities to the pixel level.
To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor.
To address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset.
arXiv Detail & Related papers (2025-01-12T14:45:27Z) - Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding [0.0]
We propose a mask-text instruction tuning method called Aquila-plus to achieve pixel-level visual understanding.
Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features.
Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks.
arXiv Detail & Related papers (2024-11-09T10:42:57Z) - Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM)
We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens.
Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.
Our method unifies the prompt and answer of visual referential tasks without using additional syntax.
ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for
Referring Image Segmentation [104.5033800500497]
Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence.
Previous works learn to straightforwardly align the sentence embedding and pixel-level embedding for highlighting the referred objects.
We propose CoupAlign, a simple yet effective multi-level visual-semantic alignment method.
arXiv Detail & Related papers (2022-12-04T08:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.