Multi-Agent Multimodal Models for Multicultural Text to Image Generation
- URL: http://arxiv.org/abs/2502.15972v1
- Date: Fri, 21 Feb 2025 22:19:56 GMT
- Title: Multi-Agent Multimodal Models for Multicultural Text to Image Generation
- Authors: Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat,
- Abstract summary: We evaluate the performance of Large Language Models (LLMs) in a multi-agent interaction setting for the novel task of multicultural image generation.<n>We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages.<n>We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics.
- Score: 4.934817254755008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models [55.25892137362187]
We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs.
Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework.
We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks.
arXiv Detail & Related papers (2024-12-08T13:45:44Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning [25.956176241542597]
We introduce MosAIC, a framework to enhance cross-cultural image captioning using LMMs with distinct cultural personas.
We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania.
We show that the multi-agent interaction outperforms single-agent models across different metrics.
arXiv Detail & Related papers (2024-11-18T17:37:10Z) - M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks [10.677274746850554]
M5 is the first comprehensive benchmark designed to evaluate LMMs on diverse vision-modal tasks within a multilingual context.
We highlight substantial task-agnostic performance disparities between high- and low-resource languages.
We show that larger models do not necessarily outperform smaller ones in a multilingual setting.
arXiv Detail & Related papers (2024-07-04T09:55:04Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS)
It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair.
It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z) - Large-scale Bilingual Language-Image Contrastive Learning [17.19890778916312]
We collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP.
We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation.
Experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages.
arXiv Detail & Related papers (2022-03-28T03:02:03Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.