Related papers: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

URL: http://arxiv.org/abs/2506.18095v1
Date: Sun, 22 Jun 2025 16:51:09 GMT
Title: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Authors: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang,
Abstract summary: We present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data.<n>We develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation.
Score: 17.762312185501823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.

Related papers

Preliminary Explorations with GPT-4o(mni) Native Image Generation [7.700772640399941]
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI.<n>In this paper, we aim to explore the capabilities of GPT-4o across various tasks.
arXiv Detail & Related papers (2025-05-06T19:35:29Z)
Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields [14.805239427360208]
AIGC foundation models are powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than compact descriptors.<n>Recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities.
arXiv Detail & Related papers (2025-04-30T17:20:14Z)
An Empirical Study of GPT-4o Image Generation Capabilities [40.86026243294732]
We conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models.<n>Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling.
arXiv Detail & Related papers (2025-04-08T12:34:36Z)
GPT-4o System Card [211.87336862081963]
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.
arXiv Detail & Related papers (2024-10-25T17:43:01Z)
Paragraph-to-Image Generation with Information-Enriched Diffusion Model [62.81033771780328]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.<n>It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.<n>The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z)
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images. We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z)
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [41.84885546518666]
GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text. We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model. We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
arXiv Detail & Related papers (2023-04-20T18:25:35Z)
Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z)
Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features. It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.