Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models
- URL: http://arxiv.org/abs/2505.04650v1
- Date: Tue, 06 May 2025 18:53:34 GMT
- Title: Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models
- Authors: Kapil Wanaskar, Gaytri Jena, Magdalini Eirinaki,
- Abstract summary: This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models.<n>Our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
Related papers
- Evaluating Image Caption via Cycle-consistent Text-to-Image Generation [24.455344211552692]
We propose CAMScore, a reference-free automatic evaluation metric for image captioning models.<n>To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images.<n>Experiment results show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics.
arXiv Detail & Related papers (2025-01-07T06:35:34Z) - EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models.<n>EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency.<n>EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z) - HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Streamlined Framework for Agile Forecasting Model Development towards
Efficient Inventory Management [2.0625936401496237]
This paper proposes a framework for developing forecasting models by streamlining the connections between core components of the developmental process.
The proposed framework enables swift and robust integration of new datasets, experimentation on different algorithms, and selection of the best models.
arXiv Detail & Related papers (2023-04-13T08:52:32Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.