JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
- URL: http://arxiv.org/abs/2409.12953v3
- Date: Wed, 25 Sep 2024 01:46:10 GMT
- Title: JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
- Authors: Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas,
- Abstract summary: We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities.
Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios.
Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
- Score: 72.42826916932519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
Related papers
- TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models [52.21298691935726]
The ability to reason over time series is a fundamental skill for generalist models to solve practical problems.<n>To bridge this gap, we introduce TSRBench, a comprehensive benchmark designed to stress-test the full spectrum of time series reasoning capabilities.
arXiv Detail & Related papers (2026-01-26T18:04:54Z) - Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z) - VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors [32.4515119002324]
VisChainBench is a benchmark designed to rigorously evaluate Large Vision-Language Models (LVLMs)<n>It contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting)<n>Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias.
arXiv Detail & Related papers (2025-12-07T09:48:10Z) - MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning? [21.777853590188688]
We present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input.<n> Experiments on state-of-the-art Vision-Language Models reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty.
arXiv Detail & Related papers (2025-11-28T11:55:05Z) - When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z) - TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning [30.018325742295243]
OpenAI o3 can create and operate tools to transform images for problem-solving, also known as thinking-textitwith-images in chain-of-thought.<n>Visual Search tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning.<n>We introduce textbfTIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks.
arXiv Detail & Related papers (2025-11-03T18:40:17Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z) - VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge [45.20691825097646]
We introduce VisualPuzzles, a benchmark that targets visual reasoning.
VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning.
arXiv Detail & Related papers (2025-04-14T15:50:39Z) - TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images [1.8668361563848481]
TDBench is a comprehensive benchmark for Vision-Language Models (VLMs) in top-down image understanding.
It consists of visual question-answer pairs across ten evaluation dimensions of image understanding.
We conduct four case studies that commonly happen in real-world scenarios but are less explored.
arXiv Detail & Related papers (2025-04-01T19:01:13Z) - The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles [29.214813685163218]
Release of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models.<n>We track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles.<n>Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning.
arXiv Detail & Related papers (2025-02-03T05:47:04Z) - MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.
Existing benchmarks mostly assess their reasoning abilities in language part.
MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z) - Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images [19.923665989164387]
We propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge Large Language Models.
Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues.
Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped.
arXiv Detail & Related papers (2024-08-15T12:04:32Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - REBUS: A Robust Evaluation Benchmark of Understanding Symbols [1.90463290938268]
GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models.
Even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles.
Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
arXiv Detail & Related papers (2024-01-11T00:30:28Z) - Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.