PixelArena: A benchmark for Pixel-Precision Visual Intelligence
- URL: http://arxiv.org/abs/2512.16303v1
- Date: Thu, 18 Dec 2025 08:41:27 GMT
- Title: PixelArena: A benchmark for Pixel-Precision Visual Intelligence
- Authors: Feng Liang, Sizhe Cheng, Chenqi Yi,
- Abstract summary: In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision.<n>We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings.
- Score: 2.8513276675793855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
Related papers
- Diversity over Uniformity: Rethinking Representation in Generated Image Detection [22.020742109848317]
We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives.<n>We propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space.<n>This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings.
arXiv Detail & Related papers (2026-02-28T15:42:12Z) - Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era [16.50510044709939]
We investigate how well large-scale models, trained on vast quantities of data, represent semantic feature norms of concrete object concepts.<n>We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models.
arXiv Detail & Related papers (2025-06-04T14:18:35Z) - Prefilled responses enhance zero-shot detection of AI-generated images [2.6581858762749997]
We explore pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images.<n>We evaluate VLM performance on three benchmarks encompassing synthetic images of human faces, objects, and animals.<n>In particular, prefilling a VLM response with the task-aligned phrase "Let's examine the style and the synthesis artifacts" improves the Macro F1 scores of three widely used open-source VLMs by up to 24%.
arXiv Detail & Related papers (2025-05-20T22:44:04Z) - UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding [84.87802580670579]
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations.<n>Our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information.
arXiv Detail & Related papers (2025-04-06T09:20:49Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Zero-Shot Detection of AI-Generated Images [54.01282123570917]
We propose a zero-shot entropy-based detector (ZED) to detect AI-generated images.
Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images.
ZED achieves an average improvement of more than 3% over the SoTA in terms of accuracy.
arXiv Detail & Related papers (2024-09-24T08:46:13Z) - JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images [72.42826916932519]
We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities.<n>Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios.<n>Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
arXiv Detail & Related papers (2024-09-19T17:58:16Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - PatchCraft: Exploring Texture Patch for Efficient AI-generated Image
Detection [39.820699370876916]
We propose a novel AI-generated image detector capable of identifying fake images created by a wide range of generative models.
A novel Smash&Reconstruction preprocessing is proposed to erase the global semantic information and enhance texture patches.
Our approach outperforms state-of-the-art baselines by a significant margin.
arXiv Detail & Related papers (2023-11-21T07:12:40Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.