InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT
Beyond Language
- URL: http://arxiv.org/abs/2305.05662v4
- Date: Fri, 2 Jun 2023 16:19:48 GMT
- Title: InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT
Beyond Language
- Authors: Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa
Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu,
Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo,
Jifeng Dai, Yu Qiao
- Abstract summary: InternGPT stands for textbfinteraction, textbfnonverbal, and textbfchatbots.
We present an interactive visual framework named InternGPT, or iGPT for short.
- Score: 82.92236977726655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an interactive visual framework named InternGPT, or iGPT for
short. The framework integrates chatbots that have planning and reasoning
capabilities, such as ChatGPT, with non-verbal instructions like pointing
movements that enable users to directly manipulate images or videos on the
screen. Pointing (including gestures, cursors, etc.) movements can provide more
flexibility and precision in performing vision-centric tasks that require
fine-grained control, editing, and generation of visual content. The name
InternGPT stands for \textbf{inter}action, \textbf{n}onverbal, and
\textbf{chat}bots. Different from existing interactive systems that rely on
pure language, by incorporating pointing instructions, the proposed iGPT
significantly improves the efficiency of communication between users and
chatbots, as well as the accuracy of chatbots in vision-centric tasks,
especially in complicated visual scenarios where the number of objects is
greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used
to improve the control capability of LLM, and a large vision-language model
termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing
ChatGPT-3.5-turbo with 93.89\% GPT-4 Quality). We hope this work can spark new
ideas and directions for future interactive visual systems. Welcome to watch
the code at https://github.com/OpenGVLab/InternGPT.
Related papers
- Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual
Reasoning [0.0]
We subject Google Bard and GPT-Vision to 64 visual tasks spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction"
Our findings spotlight both vision-language model's limitations.
arXiv Detail & Related papers (2023-08-17T03:14:00Z) - AmadeusGPT: a natural language interface for interactive animal
behavioral analysis [65.55906175884748]
We introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code.
We show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks.
AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system.
arXiv Detail & Related papers (2023-07-10T19:15:17Z) - GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System [8.660929270060146]
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs)
The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech.
arXiv Detail & Related papers (2023-05-10T10:14:16Z) - ChatLLM Network: More brains, More intelligence [42.65167827451101]
We propose ChatLLM network that allows multiple dialogue-based language models to interact, provide feedback, and think together.
We show that our network attains significant improvements in problem-solving, leading to observable progress amongst each member.
arXiv Detail & Related papers (2023-04-24T08:29:14Z) - Towards Making the Most of ChatGPT for Machine Translation [75.576405098545]
ChatGPT shows remarkable capabilities for machine translation (MT)
Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages.
arXiv Detail & Related papers (2023-03-24T03:35:21Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - FaceChat: An Emotion-Aware Face-to-face Dialogue Framework [58.67608580694849]
FaceChat is a web-based dialogue framework that enables emotionally-sensitive and face-to-face conversations.
System has a wide range of potential applications, including counseling, emotional support, and personalized customer service.
arXiv Detail & Related papers (2023-03-08T20:45:37Z) - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation
Models [55.11367495777145]
ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains.
However, since ChatGPT is trained with languages, it is not capable of processing or generating images from the visual world.
Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of different Visual Foundation Models.
arXiv Detail & Related papers (2023-03-08T15:50:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.