Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
- URL: http://arxiv.org/abs/2405.08748v1
- Date: Tue, 14 May 2024 16:33:25 GMT
- Title: Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
- Authors: Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu,
- Abstract summary: Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese.
For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.
- Score: 57.22231959529641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT
Related papers
- AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support [35.17427411750043]
We present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model.
We extend the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training.
Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.
arXiv Detail & Related papers (2024-01-26T07:17:50Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with
Images as Pivots [80.32906566894171]
We propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese.
IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently.
Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%10% training data.
arXiv Detail & Related papers (2023-05-19T09:20:27Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Visually Grounded Reasoning across Languages and Cultures [27.31020761908739]
We develop a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures.
We focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish.
We create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images.
arXiv Detail & Related papers (2021-09-28T16:51:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.