Related papers: VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

URL: http://arxiv.org/abs/2403.09027v1
Date: Thu, 14 Mar 2024 01:39:40 GMT
Title: VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Authors: Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou,
Abstract summary: We introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision.
Score: 47.58359136198136
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model

Related papers

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains [31.828341309787042]
Vision-language models (VLMs) achieve remarkable success in single-image tasks. Real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline. We propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios.
arXiv Detail & Related papers (2025-04-28T19:02:18Z)
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos. Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception [24.406224705072763]
Mutually Reinforced Multimodal Large Language Model (MR-MLLM) is a novel framework that enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs.
arXiv Detail & Related papers (2024-06-22T07:10:36Z)
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding [47.58359136198136]
VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models. It identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs.
arXiv Detail & Related papers (2024-03-14T16:13:00Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z)
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds [37.22688246779871]
Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world. LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game" We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
arXiv Detail & Related papers (2023-10-20T03:22:05Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.