Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
  Images as Pivots
        - URL: http://arxiv.org/abs/2305.11540v1
- Date: Fri, 19 May 2023 09:20:27 GMT
- Title: Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
  Images as Pivots
- Authors: Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu,
  Maosong Sun
- Abstract summary: We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese.
IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently.
 Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
- Score: 80.32906566894171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Diffusion models have made impressive progress in text-to-image synthesis.
However, training such large-scale models (e.g. Stable Diffusion), from scratch
requires high computational costs and massive high-quality text-image pairs,
which becomes unaffordable in other languages. To handle this challenge, we
propose IAP, a simple but effective method to transfer English Stable Diffusion
into Chinese. IAP optimizes only a separate Chinese text encoder with all other
parameters fixed to align Chinese semantics space to the English one in CLIP.
To achieve this, we innovatively treat images as pivots and minimize the
distance of attentive features produced from cross-attention between images and
each language respectively. In this way, IAP establishes connections of
Chinese, English and visual semantics in CLIP's embedding space efficiently,
advancing the quality of the generated image with direct Chinese prompts.
Experimental results show that our method outperforms several strong Chinese
diffusion models with only 5%~10% training data.
 
      
        Related papers
        - Seedream 2.0: A Native Chinese-English Bilingual Image Generation   Foundation Model [69.09404597939744]
 Seedream 2.0 is a native Chinese-English bilingual image generation foundation model.
It adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering.
It is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data.
 arXiv  Detail & Related papers  (2025-03-10T17:58:33Z)
- LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven   Language Representation [14.877355149519198]
 We introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models.
Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information.
 arXiv  Detail & Related papers  (2025-02-25T15:42:34Z)
- Dynamic data sampler for cross-language transfer learning in large   language models [34.464472766868106]
 ChatFlow is a cross-language transfer-based Large Language Models (LLMs)
We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model.
 Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance.
 arXiv  Detail & Related papers  (2024-05-17T08:40:51Z)
- Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with   Fine-Grained Chinese Understanding [57.22231959529641]
 Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese.
For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.
 arXiv  Detail & Related papers  (2024-05-14T16:33:25Z)
- A Progressive Framework of Vision-language Knowledge Distillation and   Alignment for Multilingual Scene [11.265838907079196]
 We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context.
In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment.
 Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
 arXiv  Detail & Related papers  (2024-04-17T10:56:06Z)
- Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with   Large Vision-Language Model Support [35.17427411750043]
 We present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model.
We extend the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training.
Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.
 arXiv  Detail & Related papers  (2024-01-26T07:17:50Z)
- Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
 Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
 arXiv  Detail & Related papers  (2023-09-19T19:30:56Z)
- PAI-Diffusion: Constructing and Serving a Family of Open Chinese
  Diffusion Models for Text-to-image Synthesis on the Cloud [54.046884854230555]
 This paper introduces PAI-Diffusion, a comprehensive framework for Chinese text-to-image synthesis.
It incorporates both general and domain-specific Chinese diffusion models, enabling the generation of contextually relevant images.
It seamlessly integrates with Alibaba Cloud's Machine Learning Platform for AI, providing accessible and scalable solutions.
 arXiv  Detail & Related papers  (2023-09-11T15:18:28Z)
- Parameter-Efficient Cross-lingual Transfer of Vision and Language Models
  via Translation-based Alignment [31.885608173448368]
 Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts.
 disparities in performance among different languages have been observed due to uneven resource availability.
We propose a new parameter-efficient cross-lingual transfer learning framework that utilizes a translation-based alignment method to mitigate multilingual disparities.
 arXiv  Detail & Related papers  (2023-05-02T14:09:02Z)
- Shifted Diffusion for Text-to-image Generation [65.53758187995744]
 Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
 Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
 arXiv  Detail & Related papers  (2022-11-24T03:25:04Z)
- Photorealistic Text-to-Image Diffusion Models with Deep Language
  Understanding [53.170767750244366]
 Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
 arXiv  Detail & Related papers  (2022-05-23T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.