Related papers: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

URL: http://arxiv.org/abs/2504.02826v4
Date: Tue, 27 May 2025 15:54:58 GMT
Title: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan,
Abstract summary: We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
Score: 84.16442052968615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-4o-Image, achieves an accuracy of merely 28.8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https://github.com/PhoenixZ810/RISEBench.

Related papers

Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning [14.984593408786045]
We propose ReasonBench to evaluate the performance of visual language models (VLMs) in graphic reasoning tasks.<n>ReasonBench includes 1,613 questions from real-world intelligence tests.<n>We benchmark 11 mainstream VLMs and reveal significant limitations of current models.
arXiv Detail & Related papers (2025-08-01T05:12:38Z)
MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning [15.17428354380373]
We introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition.<n>We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability.
arXiv Detail & Related papers (2025-07-09T21:44:12Z)
VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories? [34.7828249918764]
We introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities.<n>VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task.
arXiv Detail & Related papers (2025-06-13T08:27:45Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models [88.58758610679762]
We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens.<n>We categorize editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural.<n>To support fine-grained evaluation, we propose a protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies.
arXiv Detail & Related papers (2025-05-22T14:08:59Z)
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z)
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z)
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models [40.87249469370042]
Vision-language models (VLRMs) have become increasingly pivotal in the reasoning process.<n>Existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities.<n>We propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions.
arXiv Detail & Related papers (2025-03-10T15:52:57Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data [35.229595049396245]
We propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs.<n>Our approach begins by synthesizing interpretable answers that include human-verifiable visual features.<n>After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers.
arXiv Detail & Related papers (2025-02-19T19:05:45Z)
Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task [3.2228025627337864]
Advancing machine visual reasoning requires a deeper understanding of how Vision-Language Models (VLMs) process and interpret complex visual patterns.<n>This work introduces a novel, cognitively-inspired evaluation framework to systematically analyze VLM reasoning on natural image-based Bongard Problems.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends [67.43992456058541]
Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on. Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions. The all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types.
arXiv Detail & Related papers (2024-10-19T11:11:09Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance. Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low. We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z)
Counterfactual Edits for Generative Evaluation [0.0]
We propose a framework for the evaluation and explanation of synthesized results based on concepts instead of pixels. Our framework exploits knowledge-based counterfactual edits that underline which objects or attributes should be inserted, removed, or replaced from generated images. Global explanations produced by accumulating local edits can also reveal what concepts a model cannot generate in total.
arXiv Detail & Related papers (2023-03-02T20:10:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.