ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
- URL: http://arxiv.org/abs/2506.18095v1
- Date: Sun, 22 Jun 2025 16:51:09 GMT
- Title: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
- Authors: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang,
- Abstract summary: We present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data.<n>We develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation.
- Score: 17.762312185501823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
Related papers
- Preliminary Explorations with GPT-4o(mni) Native Image Generation [7.700772640399941]
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI.<n>In this paper, we aim to explore the capabilities of GPT-4o across various tasks.
arXiv Detail & Related papers (2025-05-06T19:35:29Z) - Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields [14.805239427360208]
AIGC foundation models are powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than compact descriptors.<n>Recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities.
arXiv Detail & Related papers (2025-04-30T17:20:14Z) - An Empirical Study of GPT-4o Image Generation Capabilities [40.86026243294732]
We conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models.<n>Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling.
arXiv Detail & Related papers (2025-04-08T12:34:36Z) - GPT-4o System Card [211.87336862081963]
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video.
It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network.
It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.
arXiv Detail & Related papers (2024-10-25T17:43:01Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [62.81033771780328]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.<n>It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.<n>The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models [41.84885546518666]
GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text.
We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model.
We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
arXiv Detail & Related papers (2023-04-20T18:25:35Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.