Related papers: Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

URL: http://arxiv.org/abs/2412.01027v2
Date: Tue, 03 Dec 2024 03:32:00 GMT
Title: Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Authors: Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao,
Abstract summary: We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.<n>We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.<n>Our method surpasses previous few-shot image manipulation models by a notable margin.
Score: 70.95783968368124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

Related papers

Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score [4.8677910801584385]
Large-scale text-to-image generative models have shown remarkable ability to synthesize diverse and high-quality images.<n>We present Dual Contrastive Denoising Score, a framework that leverages the rich generative prior of text-to-image diffusion models.<n>Our method achieves both flexible content modification and structure preservation between input and output images, as well as zero-shot image-to-image translation.
arXiv Detail & Related papers (2025-08-18T08:30:07Z)
A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis [0.3495246564946556]
We provide an overview of the approaches of contrastive learning in text-image models in recent years. Thirdly, we introduce and discuss the latest advances of the techniques used in the process. We discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.
arXiv Detail & Related papers (2025-03-14T05:43:47Z)
Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning [42.923762020491495]
We propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions. Our proposed LRT outperforms the state-of-the-art models by over $13%$ and $7%$ on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.
arXiv Detail & Related papers (2025-01-10T10:59:27Z)
TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications. We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
Multimodal Few-Shot Learning with Frozen Language Models [36.75551859968596]
We train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples.
arXiv Detail & Related papers (2021-06-25T21:07:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.