Fugu-MT 論文翻訳(概要): IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

論文の概要: IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

arxiv url: http://arxiv.org/abs/2312.01771v1
Date: Mon, 4 Dec 2023 09:48:29 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-05 15:24:59.629774
Title: IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks
Title（参考訳）: IMProv:コンピュータビジョンタスクのためのペイントベースのマルチモーダルプロンプト
Authors: Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang
Abstract要約: 本稿では,マルチモーダルプロンプトから視覚タスクをインコンテキストで学習できる生成モデルIMProvを提案する。我々は、コンピュータビジョン論文とその関連キャプションから、新しい数字のデータセットにマスク付き生成変換器を訓練する。推測時間中、テキストおよび/または画像タスクの例でモデルにプロンプトし、そのモデルに対応する出力を印字させる。
参考スコア（独自算出の注目度）: 124.90137528319273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .
Abstract（参考訳）: インコンテキスト学習は、テスト時にタスク記述が与えられた新しいタスクにモデルを適用することを可能にする。本稿では,マルチモーダルプロンプトから視覚タスクをインコンテキストで学習可能な生成モデルIMProvを提案する。視覚的タスクのテキスト記述("left: input image, right: foreground segmentation"など)や、いくつかの入出力ビジュアル例、あるいはその両方を与えられたモデルインコンテキストは、新しいテスト入力のためにそれを解くために学習する。我々は,コンピュータビジョン論文とその関連キャプションから得られた画像の新たなデータセットと,キャプション付き大規模画像テキストデータセットにマスク付き生成変換器を訓練する。推論時間中に、テキストおよび/または画像タスク例(s)でモデルをプロンプトし、対応する出力をモデルに入力させる。テキストコンディショニングによるモデルのトレーニングとデータセットサイズの拡大により,前景セグメンテーションでは+10\% ap,単一オブジェクト検出では+5\%,カラー化では約20\%のlpipでコンピュータビジョンタスクの文脈内学習が向上することが示された。実験結果から,視覚と言語プロンプトは相補的であり,文脈内学習性能の向上に有効であることが示唆された。プロジェクトページはhttps://jerryxu.net/IMProv で公開されている。

論文の概要: IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

関連論文リスト