Related papers: UEval: A Benchmark for Unified Multimodal Generation

UEval: A Benchmark for Unified Multimodal Generation

URL: http://arxiv.org/abs/2601.22155v1
Date: Thu, 29 Jan 2026 18:59:52 GMT
Title: UEval: A Benchmark for Unified Multimodal Generation
Authors: Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu,
Abstract summary: We introduce UEval, a benchmark to evaluate unified models capable of generating both images and text.<n> UEval comprises 1,000 expert-curated questions that require both images and text in the model output.<n>Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations.
Score: 27.555018737280772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

Related papers

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning [26.133555631867385]
OneReward is a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks.<n>We develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning.
arXiv Detail & Related papers (2025-08-28T17:59:46Z)
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts [14.966795545558474]
This paper introduces WikiMixQA, a benchmark for evaluating cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages.<n>We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve 70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required.
arXiv Detail & Related papers (2025-06-18T16:09:18Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.<n>We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.<n>Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z)
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation [59.53678957969471]
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks.<n> generating interleaved image-text content remains a challenge.<n>OpenING is a benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks.<n>IntJudge is a judge model for evaluating open-ended multimodal generation methods.
arXiv Detail & Related papers (2024-11-27T16:39:04Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
OLMES: A Standard for Language Model Evaluations [64.85905119836818]
OLMES is a documented, practical, open standard for reproducible language model evaluations.<n>It supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.<n> OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
REBUS: A Robust Evaluation Benchmark of Understanding Symbols [1.90463290938268]
GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. Even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
arXiv Detail & Related papers (2024-01-11T00:30:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.