Related papers: CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

URL: http://arxiv.org/abs/2512.13276v1
Date: Mon, 15 Dec 2025 12:36:50 GMT
Title: CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
Authors: Yan Li, Lin Liu, Xiaopeng Zhang, Wei Xue, Wenhan Luo, Yike Guo, Qi Tian,
Abstract summary: We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization.<n>Our method achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation.
Score: 88.9067184995168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods strug- gle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across con- secutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation

Related papers

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction [35.30036388020098]
We present GloSplat, a framework that performs emphjoint pose-appearance optimization during 3D Gaussian Splatting training.<n>Unlike prior joint optimization methods, GloSplat preserves emphexplicit SfM feature tracks as first-class entities throughout training.<n>Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.
arXiv Detail & Related papers (2026-03-05T06:02:50Z)
Instance-Guided Class Activation Mapping for Weakly Supervised Semantic Segmentation [5.539128209356213]
We propose IG-CAM, a novel approach to generate high-quality, boundary-aware localization maps.<n>Our approach demonstrates superior localization accuracy, with complete object coverage and precise boundary delineation.<n>The results establish IG-CAM as a new benchmark for weakly supervised semantic segmentation.
arXiv Detail & Related papers (2025-09-15T22:41:44Z)
Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting [6.336372495476242]
We propose a comprehensive optimization framework integrating multisample anti-aliasing with dual geometric constraints.<n>Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components.<n>Our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities.
arXiv Detail & Related papers (2025-08-14T10:14:36Z)
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z)
Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition [4.192370959537781]
We present a semi-supervised fine-tuning approach designed to improve the performance of pre-trained foundation models on downstream tasks with limited labeled data. We evaluate our approach on multiple datasets, including MNIST, its augmented variations, CIFAR-10, SVHN, and GalaxyMNIST.
arXiv Detail & Related papers (2024-10-02T22:36:12Z)
LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
Bridging CLIP and StyleGAN through Latent Alignment for Image Editing [33.86698044813281]
We bridge CLIP and StyleGAN to achieve inference-time optimization-free diverse manipulation direction mining. With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation.
arXiv Detail & Related papers (2022-10-10T09:17:35Z)
An Adaptive Framework for Learning Unsupervised Depth Completion [59.17364202590475]
We present a method to infer a dense depth map from a color image and associated sparse depth measurements. We show that regularization and co-visibility are related via the fitness of the model to data and can be unified into a single framework.
arXiv Detail & Related papers (2021-06-06T02:27:55Z)
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions [36.82512331179322]
Recent research suggests that network components dealing with different modalities may overfit and generalize at different speeds, creating difficulties for training. We propose layer-wise adaptive rate scaling (LARS) to align the magnitudes of gradient updates in different layers and balance the pace of learning. We also use sequence-wise batch normalization (SBN) to align the internal feature distributions from different modalities.
arXiv Detail & Related papers (2020-11-15T13:04:25Z)
Domain Adaptive Person Re-Identification via Coupling Optimization [58.567492812339566]
Domain adaptive person Re-Identification (ReID) is challenging owing to the domain gap and shortage of annotations on target scenarios. This paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization ( GLO) GLO is designed to train the ReID model with unsupervised setting on the target domain.
arXiv Detail & Related papers (2020-11-06T14:01:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.