IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks
- URL: http://arxiv.org/abs/2312.01771v1
- Date: Mon, 4 Dec 2023 09:48:29 GMT
- Title: IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks
- Authors: Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao,
Trevor Darrell, Xiaolong Wang
- Abstract summary: In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts.
We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions.
During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
- Score: 124.90137528319273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning allows adapting a model to new tasks given a task
description at test time. In this paper, we present IMProv - a generative model
that is able to in-context learn visual tasks from multimodal prompts. Given a
textual description of a visual task (e.g. "Left: input image, Right:
foreground segmentation"), a few input-output visual examples, or both, the
model in-context learns to solve it for a new test input. We train a masked
generative transformer on a new dataset of figures from computer vision papers
and their associated captions, together with a captioned large-scale image-text
dataset. During inference time, we prompt the model with text and/or image task
example(s) and have the model inpaint the corresponding output. We show that
training our model with text conditioning and scaling the dataset size improves
in-context learning for computer vision tasks by over +10\% AP for Foreground
Segmentation, over +5\% gains in AP for Single Object Detection, and almost
20\% lower LPIPS in Colorization. Our empirical results suggest that vision and
language prompts are complementary and it is advantageous to use both to
achieve better in-context learning performance. Project page is available at
https://jerryxu.net/IMProv .
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - MOFI: Learning Image Representations from Noisy Entity Annotated Images [47.6984817573981]
We present MOFI, a new vision foundation model designed to learn image representations from noisy entity annotated images.
We introduce a new approach to automatically assign entity labels to images from noisy image-text pairs.
Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image.
arXiv Detail & Related papers (2023-06-13T17:51:18Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - Visual Prompting via Image Inpainting [104.98602202198668]
Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image.
We apply visual prompting to pretrained models and demonstrate results on various downstream image-to-image tasks.
arXiv Detail & Related papers (2022-09-01T17:59:33Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.