Fugu-MT 論文翻訳(概要): InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

論文の概要: InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

arxiv url: http://arxiv.org/abs/2310.00390v1
Date: Sat, 30 Sep 2023 14:26:43 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-05 04:40:52.741983
Title: InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
Title（参考訳）: InstructCV:ビジョンジェネラリストとしてのインストラクション付きテキスト-画像拡散モデル
Authors: Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa
Abstract要約: 我々は,タスク固有の設計選択を抽象化する,コンピュータビジョンタスクのための統一言語インタフェースを開発する。 InstructCVと呼ばれる我々のモデルは、他のジェネラリストやタスク固有の視覚モデルと比較して競合的に機能する。
参考スコア（独自算出の注目度）: 70.83664336391922
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.
Abstract（参考訳）: 近年の生成拡散モデルの進歩により、テキスト制御によるリアルで多彩な画像の合成が可能となった。これらの顕著な進歩にもかかわらず、標準的な視覚認識タスクに対するコンピュータビジョンにおけるテキストから画像への生成モデルの適用は限られている。これらのタスクの現在の事実上のアプローチは、そのタスクに合わせたモデルアーキテクチャと損失関数を設計することである。本稿では,タスク固有の設計選択を抽象化し,自然言語命令に従うことでタスク実行を可能にする,コンピュータビジョンタスクのための統一言語インタフェースを開発する。提案手法では,複数のコンピュータビジョンタスクをテキスト対画像生成問題としてキャストする。ここで、テキストはタスクを記述する命令を表し、その結果の画像は視覚的にコード化されたタスク出力である。モデルをトレーニングするために、セグメンテーション、オブジェクト検出、深さ推定、分類など、さまざまなタスクをカバーする一般的なコンピュータビジョンデータセットをプールします。そこで我々は,各画像上で実行すべき特定のタスクを伝達するテンプレートのプロンプトを,大規模言語モデルで表現し,このプロセスを通じて,入力および出力画像と注釈付き指示を含むマルチモーダル・マルチタスク訓練データセットを作成する。 InstructPix2Pixアーキテクチャに従うと、構築したデータセットを用いて、命令チューニングをテキストから画像への拡散モデルに適用し、その機能を生成モデルから命令誘導型マルチタスク視覚学習者へ誘導する。 InstructCVと呼ばれる我々のモデルは、他のジェネラリストやタスク固有の視覚モデルと比較して競合的に機能することを示した。さらに、見えないデータ、カテゴリ、ユーザー指示に対する説得力のある一般化機能を示す。

論文の概要: InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

関連論文リスト