AltDiffusion: A Multilingual Text-to-Image Diffusion Model
- URL: http://arxiv.org/abs/2308.09991v2
- Date: Wed, 23 Aug 2023 05:19:03 GMT
- Title: AltDiffusion: A Multilingual Text-to-Image Diffusion Model
- Authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
- Abstract summary: We present AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages.
Specifically, we first train a multilingual text encoder based on the knowledge distillation.
Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability.
- Score: 4.534546889526814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Text-to-Image(T2I) diffusion models have shown a remarkable capability
to produce photorealistic and diverse images based on text inputs. However,
existing works only support limited language input, e.g., English, Chinese, and
Japanese, leaving users beyond these languages underserved and blocking the
global expansion of T2I models. Therefore, this paper presents AltDiffusion, a
novel multilingual T2I diffusion model that supports eighteen different
languages. Specifically, we first train a multilingual text encoder based on
the knowledge distillation. Then we plug it into a pretrained English-only
diffusion model and train the model with a two-stage schema to enhance the
multilingual capability, including concept alignment and quality improvement
stage on a large-scale multilingual dataset. Furthermore, we introduce a new
benchmark, which includes Multilingual-General-18(MG-18) and
Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I
diffusion models for generating high-quality images and capturing
culture-specific concepts in different languages. Experimental results on both
MG-18 and MC-18 demonstrate that AltDiffusion outperforms current
state-of-the-art T2I models, e.g., Stable Diffusion in multilingual
understanding, especially with respect to culture-specific concepts, while
still having comparable capability for generating high-quality images. All
source code and checkpoints could be found in
https://github.com/superhero-7/AltDiffuson.
Related papers
- Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support [35.17427411750043]
We present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model.
We extend the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training.
Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.
arXiv Detail & Related papers (2024-01-26T07:17:50Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.