Fugu-MT 論文翻訳(概要): The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

論文の概要: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

arxiv url: http://arxiv.org/abs/2309.17421v2
Date: Wed, 11 Oct 2023 05:07:37 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-13 02:49:32.136986
Title: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Title（参考訳）: LMMの夜明け: GPT-4V(ision)による予備探査
Authors: Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
Abstract要約: 我々は,最新のモデルであるGPT-4Vを分析し,LMMの理解を深める。 GPT-4Vは、任意にインターリーブされたマルチモーダル入力を処理するという前例のない能力により、強力なマルチモーダルジェネラリストシステムとなっている。 GPT-4Vの、入力画像に描かれた視覚マーカーを理解するユニークな能力は、新しい人間とコンピュータの相互作用方法をもたらす。
参考スコア（独自算出の注目度）: 121.42924593374127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は、より強力な汎用知性を達成するために、視覚的理解などの多感覚スキルを備えた大規模言語モデル(LLM)を拡張する。本稿では,最新のモデルであるGPT-4V(ision)を分析し,LMMの理解を深める。この分析は、GPT-4Vが実行可能な興味深いタスクに焦点を当てており、GPT-4Vの能力の質と汎用性、サポートされた入力と動作モード、そしてモデルを刺激する効果的な方法を調べるためのテストサンプルを含んでいる。 GPT-4Vの探索にあたり、様々な領域やタスクにまたがる慎重に設計された定性的サンプルの収集と整理を行う。これらのサンプルから得られた観測は、GPT-4Vが任意にインターリーブされたマルチモーダル入力を処理するという前例のない能力と、その能力の汎用性によって、GPT-4Vが強力なマルチモーダルジェネリストシステムになることを示している。さらに、入力画像上に描画された視覚マーカーを理解するGPT-4Vのユニークな能力は、視覚的参照プロンプトのような新しい人間とコンピュータの相互作用方法を引き起こす可能性がある。本報告は,GPT-4Vベースのシステムにおける今後の応用シナリオと今後の研究方向性について,詳細な議論で締めくくっている。この予備的な調査によって、次世代マルチモーダルタスクの定式化、LMMを活用・拡張して現実の問題を解決する新しい方法、マルチモーダル基盤モデルの理解を深めることが期待されている。最後に、我々の研究対象のモデルはOpenAIの革新的な成果の産物であり、その開発に完全に貢献すべきであることを認めます。 GPT-4Vコントリビューション論文(source)とクレジット属性(source)をご覧ください。

論文の概要: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

関連論文リスト