Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation
Models
- URL: http://arxiv.org/abs/2303.04671v1
- Date: Wed, 8 Mar 2023 15:50:02 GMT
- Title: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation
Models
- Authors: Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang,
Nan Duan
- Abstract summary: ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains.
However, since ChatGPT is trained with languages, it is not capable of processing or generating images from the visual world.
Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of different Visual Foundation Models.
- Score: 55.11367495777145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ChatGPT is attracting a cross-field interest as it provides a language
interface with remarkable conversational competency and reasoning capabilities
across many domains. However, since ChatGPT is trained with languages, it is
currently not capable of processing or generating images from the visual world.
At the same time, Visual Foundation Models, such as Visual Transformers or
Stable Diffusion, although showing great visual understanding and generation
capabilities, they are only experts on specific tasks with one-round fixed
inputs and outputs. To this end, We build a system called \textbf{Visual
ChatGPT}, incorporating different Visual Foundation Models, to enable the user
to interact with ChatGPT by 1) sending and receiving not only languages but
also images 2) providing complex visual questions or visual editing
instructions that require the collaboration of multiple AI models with
multi-steps. 3) providing feedback and asking for corrected results. We design
a series of prompts to inject the visual model information into ChatGPT,
considering models of multiple inputs/outputs and models that require visual
feedback. Experiments show that Visual ChatGPT opens the door to investigating
the visual roles of ChatGPT with the help of Visual Foundation Models. Our
system is publicly available at
\url{https://github.com/microsoft/visual-chatgpt}.
Related papers
- Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [27.45225442048711]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
arXiv Detail & Related papers (2024-03-14T12:21:37Z) - MIVC: Multiple Instance Visual Component for Visual-Language Models [46.869139462026]
We propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models.
We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks.
arXiv Detail & Related papers (2023-12-28T16:33:32Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual
Reasoning [0.0]
We subject Google Bard and GPT-Vision to 64 visual tasks spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction"
Our findings spotlight both vision-language model's limitations.
arXiv Detail & Related papers (2023-08-17T03:14:00Z) - LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models [98.81962282674151]
Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions.
We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language.
arXiv Detail & Related papers (2023-05-24T17:56:16Z) - InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT
Beyond Language [82.92236977726655]
InternGPT stands for textbfinteraction, textbfnonverbal, and textbfchatbots.
We present an interactive visual framework named InternGPT, or iGPT for short.
arXiv Detail & Related papers (2023-05-09T17:58:34Z) - ChatLLM Network: More brains, More intelligence [42.65167827451101]
We propose ChatLLM network that allows multiple dialogue-based language models to interact, provide feedback, and think together.
We show that our network attains significant improvements in problem-solving, leading to observable progress amongst each member.
arXiv Detail & Related papers (2023-04-24T08:29:14Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.