Related papers: Image Translation as Diffusion Visual Programmers

Image Translation as Diffusion Visual Programmers

URL: http://arxiv.org/abs/2401.09742v2
Date: Tue, 30 Jan 2024 22:49:18 GMT
Title: Image Translation as Diffusion Visual Programmers
Authors: Cheng Han, James C. Liang, Qifan Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, Dongfang Liu
Abstract summary: Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework. Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
Score: 52.09889190442439
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.

Related papers

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models [0.7366405857677226]
Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy. VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features. It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
arXiv Detail & Related papers (2025-01-01T18:27:13Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges. Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z)
LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution [67.23699927053191]
We propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
arXiv Detail & Related papers (2024-11-14T09:12:18Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task. It captures high-order correlations of forged images from diverse linguistic feature spaces. It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z)
Interactive Face Video Coding: A Generative Compression Framework [18.26476468644723]
We propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals. The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression and headpose animation.
arXiv Detail & Related papers (2023-02-20T11:24:23Z)
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data. We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.