Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts
- URL: http://arxiv.org/abs/2312.01408v1
- Date: Sun, 3 Dec 2023 14:15:52 GMT
- Title: Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts
- Authors: Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You,
Hongxia Yang, Mingyuan Zhou
- Abstract summary: We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
- Score: 83.03471704115786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In light of the remarkable success of in-context learning in large language
models, its potential extension to the vision domain, particularly with visual
foundation models like Stable Diffusion, has sparked considerable interest.
Existing approaches in visual in-context learning frequently face hurdles such
as expensive pretraining, limiting frameworks, inadequate visual comprehension,
and limited adaptability to new tasks. In response to these challenges, we
introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff
integrates an end-to-end trained vision encoder that converts visual context
into an embedding vector. This vector is subsequently used to modulate the
token embeddings of text prompts. We show that a diffusion-based vision
foundation model, when equipped with this visual context-modulated text
guidance and a standard ControlNet structure, exhibits versatility and
robustness across a variety of training tasks and excels in in-context learning
for novel vision tasks, such as normal-to-image or image-to-line
transformations. The effectiveness of these capabilities relies heavily on a
deep visual understanding, which is achieved through relevant visual
demonstrations processed by our proposed in-context learning architecture.
Related papers
- Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities.
We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant [48.220285886328746]
We introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant.
SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge.
Fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods.
arXiv Detail & Related papers (2024-03-17T18:42:38Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.