The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World
- URL: http://arxiv.org/abs/2308.01907v1
- Date: Thu, 3 Aug 2023 17:59:47 GMT
- Title: The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World
- Authors: Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie
Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng
Dai, Yu Qiao
- Abstract summary: We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions.
We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
- Score: 71.52132776748628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the All-Seeing (AS) project: a large-scale data and model for
recognizing and understanding everything in the open world. Using a scalable
data engine that incorporates human feedback and efficient models in the loop,
we create a new dataset (AS-1B) with over 1 billion regions annotated with
semantic tags, question-answering pairs, and detailed captions. It covers a
wide range of 3.5 million common and rare concepts in the real world, and has
132.2 billion tokens that describe the concepts and their attributes.
Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified
framework for panoptic visual recognition and understanding. The model is
trained with open-ended language prompts and locations, which allows it to
generalize to various vision and language tasks with remarkable zero-shot
performance, including region-text retrieval, region recognition, captioning,
and question-answering. We hope that this project can serve as a foundation for
vision-language artificial general intelligence research. Models and the
dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo
can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.
Related papers
- Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - Composition Vision-Language Understanding via Segment and Depth Anything Model [2.0836143651641033]
This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V.
Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models.
Our findings showcase progress in vision-language models through neural-symbolic integration.
arXiv Detail & Related papers (2024-06-07T16:28:06Z) - Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models [42.48862540545121]
We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models.
EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans.
We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains.
arXiv Detail & Related papers (2024-05-15T17:19:42Z) - Griffon: Spelling out All Object Locations at Any Granularity with Large
Language Models [32.01009756533755]
Current Large Vision Language Models (LVLMs) are predominantly constrained to a single, pre-existing object.
We introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs.
$textbfGriffon$ achieves state-of-the-art performance on the fine-grained RefCOCO series.
It also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.
arXiv Detail & Related papers (2023-11-24T15:35:07Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - VisualSem: A High-quality Knowledge Graph for Vision and Language [48.47370435793127]
We release VisualSem: a high-quality knowledge graph (KG)
VisualSem includes nodes with multilingual glosses, multiple illustrative images, and visually relevant relations.
We also release a neural multi-modal retrieval model that can use images or sentences as inputs and retrieves entities in the KG.
arXiv Detail & Related papers (2020-08-20T18:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.