General Object Foundation Model for Images and Videos at Scale
- URL: http://arxiv.org/abs/2312.09158v1
- Date: Thu, 14 Dec 2023 17:26:00 GMT
- Title: General Object Foundation Model for Images and Videos at Scale
- Authors: Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
- Abstract summary: We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
- Score: 99.2806103051613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present GLEE in this work, an object-level foundation model for locating
and identifying objects in images and videos. Through a unified framework, GLEE
accomplishes detection, segmentation, tracking, grounding, and identification
of arbitrary objects in the open world scenario for various object perception
tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from
diverse data sources with varying supervision levels to formulate general
object representations, excelling in zero-shot transfer to new data and tasks.
Specifically, we employ an image encoder, text encoder, and visual prompter to
handle multi-modal inputs, enabling to simultaneously solve various
object-centric downstream tasks while maintaining state-of-the-art performance.
Demonstrated through extensive training on over five million images from
diverse benchmarks, GLEE exhibits remarkable versatility and improved
generalization performance, efficiently tackling downstream tasks without the
need for task-specific adaptation. By integrating large volumes of
automatically labeled data, we further enhance its zero-shot generalization
capabilities. Additionally, GLEE is capable of being integrated into Large
Language Models, serving as a foundational model to provide universal
object-level information for multi-modal tasks. We hope that the versatility
and universality of our method will mark a significant step in the development
of efficient visual foundation models for AGI systems. The model and code will
be released at https://glee-vision.github.io .
Related papers
- PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects [104.34288029037141]
We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images.
PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario.
arXiv Detail & Related papers (2024-07-23T17:58:26Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.