Florence: A New Foundation Model for Computer Vision
- URL: http://arxiv.org/abs/2111.11432v1
- Date: Mon, 22 Nov 2021 18:59:55 GMT
- Title: Florence: A New Foundation Model for Computer Vision
- Authors: Lu Yuan and Dongdong Chen and Yi-Ling Chen and Noel Codella and Xiyang
Dai and Jianfeng Gao and Houdong Hu and Xuedong Huang and Boxin Li and
Chunyuan Li and Ce Liu and Mengchen Liu and Zicheng Liu and Yumao Lu and Yu
Shi and Lijuan Wang and Jianfeng Wang and Bin Xiao and Zhen Xiao and Jianwei
Yang and Michael Zeng and Luowei Zhou and Pengchuan Zhang
- Abstract summary: We introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object)
By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks.
Florence achieves new state-of-the-art results in majority of 44 representative benchmarks.
- Score: 97.26333007250142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated visual understanding of our diverse and open world demands computer
vision models to generalize well with minimal customization for specific tasks,
similar to human vision. Computer vision foundation models, which are trained
on diverse, large-scale dataset and can be adapted to a wide range of
downstream tasks, are critical for this mission to solve real-world computer
vision applications. While existing vision foundation models such as CLIP,
ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual
representations to a cross-modal shared representation, we introduce a new
computer vision foundation model, Florence, to expand the representations from
coarse (scene) to fine (object), from static (images) to dynamic (videos), and
from RGB to multiple modalities (caption, depth). By incorporating universal
visual-language representations from Web-scale image-text data, our Florence
model can be easily adapted for various computer vision tasks, such as
classification, retrieval, object detection, VQA, image caption, video
retrieval and action recognition. Moreover, Florence demonstrates outstanding
performance in many types of transfer learning: fully sampled fine-tuning,
linear probing, few-shot transfer and zero-shot transfer for novel images and
objects. All of these properties are critical for our vision foundation model
to serve general purpose vision tasks. Florence achieves new state-of-the-art
results in majority of 44 representative benchmarks, e.g., ImageNet-1K
zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of
97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.
Related papers
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z) - Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks [94.49801814314435]
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
We co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images.
We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks.
arXiv Detail & Related papers (2023-11-10T18:59:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - MaxViT: Multi-Axis Vision Transformer [19.192826213493838]
We introduce an efficient and scalable attention model we call multi-axis attention.
We present a new architectural element by effectively blending our proposed attention model with convolutions.
We demonstrate the effectiveness of our model on a broad spectrum of vision tasks.
arXiv Detail & Related papers (2022-04-04T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.