Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
- URL: http://arxiv.org/abs/2504.02826v2
- Date: Tue, 08 Apr 2025 16:43:44 GMT
- Title: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
- Authors: Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan,
- Abstract summary: We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.<n>We propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach.
- Score: 90.65399476233495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.
Related papers
- Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies.
VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning.
We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z) - V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs.
We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models [40.87249469370042]
Vision-language models (VLRMs) have become increasingly pivotal in the reasoning process.<n>Existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities.<n>We propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions.
arXiv Detail & Related papers (2025-03-10T15:52:57Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task [3.2228025627337864]
Advancing machine visual reasoning requires a deeper understanding of how Vision-Language Models (VLMs) process and interpret complex visual patterns.<n>This work introduces a novel, cognitively-inspired evaluation framework to systematically analyze VLM reasoning on natural image-based Bongard Problems.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends [67.43992456058541]
Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on.
Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions.
The all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types.
arXiv Detail & Related papers (2024-10-19T11:11:09Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning.
MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points.
The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.