DECap: Towards Generalized Explicit Caption Editing via Diffusion
Mechanism
- URL: http://arxiv.org/abs/2311.14920v2
- Date: Wed, 6 Mar 2024 11:03:01 GMT
- Title: DECap: Towards Generalized Explicit Caption Editing via Diffusion
Mechanism
- Authors: Zhen Wang, Xinyun Jiang, Jun Xiao, Tao Chen, Long Chen
- Abstract summary: We propose Diffusion-based Explicit Caption editing method: DECap.
We reformulate the ECE task as a denoising process under the diffusion mechanism.
The denoising process involves the explicit predictions of edit operations and corresponding content words.
- Score: 17.03837136771538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explicit Caption Editing (ECE) -- refining reference image captions through a
sequence of explicit edit operations (e.g., KEEP, DETELE) -- has raised
significant attention due to its explainable and human-like nature. After
training with carefully designed reference and ground-truth caption pairs,
state-of-the-art ECE models exhibit limited generalization ability beyond the
original training data distribution, i.e., they are tailored to refine content
details only in in-domain samples but fail to correct errors in out-of-domain
samples. To this end, we propose a new Diffusion-based Explicit Caption editing
method: DECap. Specifically, we reformulate the ECE task as a denoising process
under the diffusion mechanism, and introduce innovative edit-based noising and
denoising processes. Thanks to this design, the noising process can help to
eliminate the need for meticulous paired data selection by directly introducing
word-level noises for training, learning diverse distribution over input
reference caption. The denoising process involves the explicit predictions of
edit operations and corresponding content words, refining reference captions
through iterative step-wise editing. To further efficiently implement our
diffusion process and improve the inference speed, DECap discards the prevalent
multi-stage design and directly generates edit operations and content words
simultaneously. Extensive ablations have demonstrated the strong generalization
ability of DECap in various scenarios. More interestingly, it even shows great
potential in improving the quality and controllability of caption generation.
Related papers
- Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing [26.02149948089938]
Instruction Influence Disentanglement (IID) is a novel framework enabling parallel execution of multiple instructions in a single denoising process.
We analyze self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.
IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines.
arXiv Detail & Related papers (2025-04-07T07:26:25Z) - SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing [5.123822132804602]
We introduce a skeleton-aware latent diffusion (SALAD) model that captures the intricate inter-relationships between joints, frames, and words.
By leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing.
Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality.
arXiv Detail & Related papers (2025-03-18T02:20:11Z) - Lost in Edits? A $λ$-Compass for AIGC Provenance [119.95562081325552]
We propose a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones.
LambdaTracer is effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix or performed manually with editing software such as Adobe Photoshop.
arXiv Detail & Related papers (2025-02-05T06:24:25Z) - UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.
Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)
CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing [42.45138713525929]
Effective editing requires inverting the source image into a latent space, a process often hindered by prediction errors inherent in DDIM inversion.
We introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing.
Our approach requires no additional retraining and is compatible with various existing editing methods.
arXiv Detail & Related papers (2024-10-24T14:07:02Z) - DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models [79.0135981840682]
We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models.
By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data.
Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
arXiv Detail & Related papers (2024-10-10T17:59:48Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models.
We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization.
Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content.
We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z) - Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z) - High-Fidelity Diffusion-based Image Editing [19.85446433564999]
The editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps.
We propose an innovative framework where a Markov module is incorporated to modulate diffusion model weights with residual features.
We introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching.
arXiv Detail & Related papers (2023-12-25T12:12:36Z) - Tuning-Free Inversion-Enhanced Control for Consistent Image Editing [44.311286151669464]
We present a novel approach called Tuning-free Inversion-enhanced Control (TIC)
TIC correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction.
We also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes.
arXiv Detail & Related papers (2023-12-22T11:13:22Z) - BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion [36.59406959595952]
We propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model.
Our method contains three novelties: (I) Targettext Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence; (II) Progressive Transition Scheme applies progressive linear approaches between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability; (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics
arXiv Detail & Related papers (2023-12-09T07:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.