Related papers: CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

URL: http://arxiv.org/abs/2508.14405v1
Date: Wed, 20 Aug 2025 04:03:54 GMT
Title: CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities
Authors: Yue Gong, Shanyuan Liu, Liuzhuozheng Li, Jian Zhu, Bo Cheng, Liebucha Wu, Xiaoyu Wu, Yuhang Ma, Dawei Leng, Yuhui Yin,
Abstract summary: An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model.<n>We introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities.
Score: 14.855163689517276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model's understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.

Related papers

Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning [52.92837273570818]
Chinese characters exhibit unique structures and compositional rules, allowing for the use of fine-grained semantic information in representation.<n>We propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm.<n>Our proposed Hi-GITA outperforms existing zero-shot CCR methods.
arXiv Detail & Related papers (2025-05-30T17:39:14Z)
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model [69.09404597939744]
Seedream 2.0 is a native Chinese-English bilingual image generation foundation model.<n>It adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering.<n>It is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data.
arXiv Detail & Related papers (2025-03-10T17:58:33Z)
A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene [11.265838907079196]
We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment. Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
arXiv Detail & Related papers (2024-04-17T10:56:06Z)
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support [35.17427411750043]
We present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model. We extend the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.
arXiv Detail & Related papers (2024-01-26T07:17:50Z)
PAI-Diffusion: Constructing and Serving a Family of Open Chinese Diffusion Models for Text-to-image Synthesis on the Cloud [54.046884854230555]
This paper introduces PAI-Diffusion, a comprehensive framework for Chinese text-to-image synthesis. It incorporates both general and domain-specific Chinese diffusion models, enabling the generation of contextually relevant images. It seamlessly integrates with Alibaba Cloud's Machine Learning Platform for AI, providing accessible and scalable solutions.
arXiv Detail & Related papers (2023-09-11T15:18:28Z)
Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots [80.32906566894171]
We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently. Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
arXiv Detail & Related papers (2023-05-19T09:20:27Z)
GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language. Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
Visually Grounded Reasoning across Languages and Cultures [27.31020761908739]
We develop a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. We focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. We create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images.
arXiv Detail & Related papers (2021-09-28T16:51:38Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.