GenExam: A Multidisciplinary Text-to-Image Exam
- URL: http://arxiv.org/abs/2509.14232v2
- Date: Thu, 02 Oct 2025 16:45:30 GMT
- Title: GenExam: A Multidisciplinary Text-to-Image Exam
- Authors: Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo,
- Abstract summary: GenExam is the first benchmark for multidisciplinary text-to-image exams.<n>It features 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy.<n>Each problem is equipped with ground-truth images and fine-grained scoring points.
- Score: 91.06661449186537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights on the path to general AGI. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.
Related papers
- ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation [151.75112778479468]
We study professional image generation, where a model must synthesize scientifically precise illustrations from technical descriptions.<n>For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks.<n>We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 accuracy and 0.553 criterion score overall.
arXiv Detail & Related papers (2025-12-13T07:13:43Z) - Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models [15.983959465314749]
We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
arXiv Detail & Related papers (2025-11-23T03:44:54Z) - WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation [98.47375190901447]
We present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation.<n>WeAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images.<n>WeAVEBench is a human-annotated benchmark with 100 tasks based on 480 images.
arXiv Detail & Related papers (2025-11-14T16:02:38Z) - MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning [20.382087716921003]
We introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG)<n>MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps.<n>We introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.
arXiv Detail & Related papers (2025-06-12T17:58:09Z) - SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model [21.81341169834812]
SridBench is the first benchmark for scientific figure generation.<n>It comprises 1,120 instances from leading scientific papers across 13 natural and computer science disciplines.<n>Results reveal that even top-tier models like GPT-4o-image lag behind human performance.
arXiv Detail & Related papers (2025-05-28T08:51:01Z) - GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z) - WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation [25.010417955441653]
Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content.<n>We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
arXiv Detail & Related papers (2025-03-10T12:47:53Z) - GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation [103.3465421081531]
VQAScore is a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt.
Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward.
We release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt.
arXiv Detail & Related papers (2024-06-19T18:00:07Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.