Related papers: Context Diffusion: In-Context Aware Image Generation

Context Diffusion: In-Context Aware Image Generation

URL: http://arxiv.org/abs/2312.03584v1
Date: Wed, 6 Dec 2023 16:19:51 GMT
Title: Context Diffusion: In-Context Aware Image Generation
Authors: Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic
Abstract summary: Context Diffusion is a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks.
Score: 29.281927418777624
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.

Related papers

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$. We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages. Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z)
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z)
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
CapText: Large Language Model-based Caption Generation From Image Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone. Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z)
In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
Context-driven Visual Object Recognition based on Knowledge Graphs [0.8701566919381223]
We propose an approach that enhances deep learning methods by using external contextual knowledge encoded in a knowledge graph. We conduct a series of experiments to investigate the impact of different contextual views on the learned object representations for the same image dataset.
arXiv Detail & Related papers (2022-10-20T13:09:00Z)
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen) Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection [18.276823176045525]
We propose a new paradigm for automatic context image generation at scale. At the core of our approach lies utilizing an interplay between language description of context and language-driven image generation. We demonstrate the advantages of our approach over the prior context image generation approaches on four object detection datasets.
arXiv Detail & Related papers (2022-06-20T06:43:17Z)
More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image. We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.