Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
- URL: http://arxiv.org/abs/2510.19808v1
- Date: Wed, 22 Oct 2025 17:43:15 GMT
- Title: Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
- Authors: Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan,
- Abstract summary: Pico-Banana-400K is a comprehensive 400K-image dataset for instruction-based image editing.<n>Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs.<n>By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
- Score: 40.13961086100904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
Related papers
- OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing [45.539561363519844]
We introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology.<n>We generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks.<n>Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
arXiv Detail & Related papers (2025-09-29T15:11:09Z) - MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks [46.87912659985628]
MultiEdit is a comprehensive dataset featuring over 107K high-quality image editing samples.<n>It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations.<n>We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions.
arXiv Detail & Related papers (2025-09-18T05:33:38Z) - Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions [20.617718631292696]
We develop a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs.<n>Our approach introduces a multi-scale learnable region to localize and guide the editing process.<n>By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing.
arXiv Detail & Related papers (2025-05-25T22:40:59Z) - FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models [48.85744313139525]
We develop FragFake, the first dedicated benchmark dataset for edited image detection.<n>We use Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization.<n>This work is the first to reformulate localized image edit detection as a vision-language understanding task.
arXiv Detail & Related papers (2025-05-21T15:22:45Z) - HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing [93.06156989757994]
HumanEdit comprises 5,751 images and requires more than 2,500 hours of human effort across four stages.<n>The dataset includes six distinct types of editing instructions: Action, Add, Counting, Relation, Remove, and Replace.<n>HumanEdit offers comprehensive diversity and high-resolution $1024 times 1024$ content sourced from various domains.
arXiv Detail & Related papers (2024-12-05T16:00:59Z) - AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset.<n>We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results.<n>Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing [38.13162627140172]
HQ-Edit is a high-quality instruction-based image editing dataset with around 200,000 edits.
To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs.
HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models.
arXiv Detail & Related papers (2024-04-15T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.