Image Translation as Diffusion Visual Programmers
- URL: http://arxiv.org/abs/2401.09742v2
- Date: Tue, 30 Jan 2024 22:49:18 GMT
- Title: Image Translation as Diffusion Visual Programmers
- Authors: Cheng Han, James C. Liang, Qifan Wang, Majid Rabbani, Sohail Dianat,
Raghuveer Rao, Ying Nian Wu, Dongfang Liu
- Abstract summary: Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
- Score: 52.09889190442439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic
image translation framework. Our proposed DVP seamlessly embeds a
condition-flexible diffusion model within the GPT architecture, orchestrating a
coherent sequence of visual programs (i.e., computer vision models) for various
pro-symbolic steps, which span RoI identification, style transfer, and position
manipulation, facilitating transparent and controllable image translation
processes. Extensive experiments demonstrate DVP's remarkable performance,
surpassing concurrent arts. This success can be attributed to several key
features of DVP: First, DVP achieves condition-flexible translation via
instance normalization, enabling the model to eliminate sensitivity caused by
the manual guidance and optimally focus on textual descriptions for
high-quality content generation. Second, the framework enhances in-context
reasoning by deciphering intricate high-dimensional concepts in feature spaces
into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]),
allowing for localized, context-free editing while maintaining overall
coherence. Last but not least, DVP improves systemic controllability and
explainability by offering explicit symbolic representations at each
programming stage, empowering users to intuitively interpret and modify
results. Our research marks a substantial step towards harmonizing artificial
image translation processes with cognitive intelligence, promising broader
applications.
Related papers
- ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses.
Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR.
We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z) - Interactive Face Video Coding: A Generative Compression Framework [18.26476468644723]
We propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals.
The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression and headpose animation.
arXiv Detail & Related papers (2023-02-20T11:24:23Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.