Related papers: The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

URL: http://arxiv.org/abs/2308.01907v1
Date: Thu, 3 Aug 2023 17:59:47 GMT
Title: The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Authors: Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao
Abstract summary: We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
Score: 71.52132776748628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

Related papers

GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models [27.878058177228727]
GeoLangBind is a novel agglomerative vision--language foundation model. It bridges the gap between heterogeneous EO data modalities using language as a unifying medium. Our approach aligns different EO data types into a shared language embedding space.
arXiv Detail & Related papers (2025-03-08T19:10:04Z)
Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z)
Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model. We amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z)
Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z)
Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks. This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z)
Composition Vision-Language Understanding via Segment and Depth Anything Model [2.0836143651641033]
This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models. Our findings showcase progress in vision-language models through neural-symbolic integration.
arXiv Detail & Related papers (2024-06-07T16:28:06Z)
Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models [42.48862540545121]
We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models. EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans. We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains.
arXiv Detail & Related papers (2024-05-15T17:19:42Z)
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
VisualSem: A High-quality Knowledge Graph for Vision and Language [48.47370435793127]
We release VisualSem: a high-quality knowledge graph (KG) VisualSem includes nodes with multilingual glosses, multiple illustrative images, and visually relevant relations. We also release a neural multi-modal retrieval model that can use images or sentences as inputs and retrieves entities in the KG.
arXiv Detail & Related papers (2020-08-20T18:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.