Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
- URL: http://arxiv.org/abs/2311.06242v1
- Date: Fri, 10 Nov 2023 18:59:08 GMT
- Title: Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
- Authors: Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu,
Michael Zeng, Ce Liu, Lu Yuan
- Abstract summary: We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
We co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images.
We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks.
- Score: 94.49801814314435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Florence-2, a novel vision foundation model with a unified,
prompt-based representation for a variety of computer vision and
vision-language tasks. While existing large vision models excel in transfer
learning, they struggle to perform a diversity of tasks with simple
instructions, a capability that implies handling the complexity of various
spatial hierarchy and semantic granularity. Florence-2 was designed to take
text-prompt as task instructions and generate desirable results in text forms,
whether it be captioning, object detection, grounding or segmentation. This
multi-task learning setup demands large-scale, high-quality annotated data. To
this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive
visual annotations on 126 million images, using an iterative strategy of
automated image annotation and model refinement. We adopted a
sequence-to-sequence structure to train Florence-2 to perform versatile and
comprehensive vision tasks. Extensive evaluations on numerous tasks
demonstrated Florence-2 to be a strong vision foundation model contender with
unprecedented zero-shot and fine-tuning capabilities.
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion [83.62294567506076]
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2.
We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs.
Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks.
arXiv Detail & Related papers (2024-12-05T18:50:39Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - MIMIC-IT: Multi-Modal In-Context Instruction Tuning [44.879418596312554]
We present a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos.
Using the MIMIC-IT dataset, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning.
We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - Florence: A New Foundation Model for Computer Vision [97.26333007250142]
We introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object)
By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks.
Florence achieves new state-of-the-art results in majority of 44 representative benchmarks.
arXiv Detail & Related papers (2021-11-22T18:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.