Related papers: Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

URL: http://arxiv.org/abs/2501.07086v1
Date: Mon, 13 Jan 2025 06:41:23 GMT
Title: Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models
Authors: Yongyu Mu, Hengyu Li, Junxin Wang, Xiaoxuan Zhou, Chenglong Wang, Yingfeng Luo, Qiaozhi He, Tong Xiao, Guocheng Chen, Jingbo Zhu,
Abstract summary: We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs)<n> Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
Score: 43.16111789538798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at https://github.com/takagi97/PMT2I.

Related papers

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data [3.364569898365253]
We propose a lightweight and data-efficient framework for multilingual vision-language alignment.<n>Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training.<n>This minimal training setup enables robust multilingual alignment even for languages with limited supervision.
arXiv Detail & Related papers (2025-11-17T06:34:49Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images [5.587329786636647]
Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space.<n>CLIP models often struggle with text-only tasks, underperforming compared to specialized text models.<n>In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages.<n>The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains.
arXiv Detail & Related papers (2024-12-11T22:28:12Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts. Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation. We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z)
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs. It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.