OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
- URL: http://arxiv.org/abs/2509.24900v1
- Date: Mon, 29 Sep 2025 15:11:09 GMT
- Title: OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
- Authors: Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang,
- Abstract summary: We introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology.<n>We generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks.<n>Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
- Score: 45.539561363519844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
Related papers
- OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models [57.94189874119267]
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems.<n>Current graph learning-based design methodologies often adhere to a "one-for-one" paradigm.<n>We propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language.
arXiv Detail & Related papers (2026-01-19T12:23:44Z) - Training a Custom CNN on Five Heterogeneous Image Datasets [1.4583375893645076]
This study investigates the effectiveness of CNN-based architectures across five datasets spanning agricultural and urban domains.<n>These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance.<n>We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning.
arXiv Detail & Related papers (2026-01-08T08:44:17Z) - Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing [40.13961086100904]
Pico-Banana-400K is a comprehensive 400K-image dataset for instruction-based image editing.<n>Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs.<n>By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
arXiv Detail & Related papers (2025-10-22T17:43:15Z) - PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization [82.96200364977737]
We introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks.<n>PlotCraft is structured around seven high-level visualization tasks and encompasses 48 distinct chart types.<n>It is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities.
arXiv Detail & Related papers (2025-10-15T10:14:39Z) - $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark [36.58090024531738]
We introduce $ttexttComplex-Edit$, a comprehensive benchmark designed to evaluate instruction-based image editing models.<n>We harness GPT-4o to automatically collect a diverse set of editing instructions at scale.<n>We introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline.
arXiv Detail & Related papers (2025-04-17T17:51:59Z) - TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types [8.755996117965571]
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data.<n>Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling.<n>We propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples.
arXiv Detail & Related papers (2025-02-14T05:32:46Z) - MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.<n>MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.<n>Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps.
A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.