Related papers: TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

URL: http://arxiv.org/abs/2511.13399v1
Date: Mon, 17 Nov 2025 14:15:03 GMT
Title: TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing
Authors: Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen, Taiping Yao, Shouhong Ding, Jianguo Zhang,
Abstract summary: Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency.<n>We propose TripleFDS, a novel framework for STE with disentangled modular attributes.<n>TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks.
Score: 56.73004765030206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

Related papers

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives [7.243114047801061]
We propose a zero-shot pipeline that produces temporally coherent, identity-preserving image sequences.<n>Story delivers expressive interactions and evolving yet stable scenes.
arXiv Detail & Related papers (2026-02-24T16:07:02Z)
ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation [14.341691123354195]
ASemconsist enables explicit semantic control over character identity without sacrificing prompt alignment.<n>Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs.
arXiv Detail & Related papers (2025-12-29T07:06:57Z)
From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval [30.33315985826623]
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text.<n>We propose a two-stage framework where the training is accomplished from mapping to composing.<n>In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module.<n>In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics.
arXiv Detail & Related papers (2025-04-25T00:18:23Z)
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval [60.20835288280572]
We propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR.<n> FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization.
arXiv Detail & Related papers (2025-03-25T02:51:25Z)
Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.<n>One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.<n>We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval [35.19203010854668]
Composed Image Retrieval (ZS-CIR) enables image search using a reference image and a text prompt without requiring specialized text-image composition networks trained on large-scale paired data.<n>We introduce the textbfPrompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts.<n>PDV enables three key improvements: (1) Dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that
arXiv Detail & Related papers (2025-02-11T03:20:21Z)
Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries.<n>We propose a lightweight post-hoc framework, consisting of two components.<n>Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation [22.263029309151467]
Few-Shot (FSS) aims to segment novel classes using only a few annotated images.<n>Current FSS methods face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations.<n>We propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process.
arXiv Detail & Related papers (2024-07-23T05:09:07Z)
Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z)
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images. Existing methods invert video frames individually often leading to undesired inconsistent results over time. We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID) Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor. PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.