ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
- URL: http://arxiv.org/abs/2507.23135v1
- Date: Wed, 30 Jul 2025 22:30:48 GMT
- Title: ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
- Authors: Ananya Sadana, Yash Kumar Lal, Jiawei Zhou,
- Abstract summary: We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text.<n> Evaluation results on ten frontier vision-language models show underwhelming performance.
- Score: 10.026145953509246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
Related papers
- Dynamic Multimodal Prototype Learning in Vision-Language Models [44.84161970425967]
We introduce textbfProtoMM, a training-free framework that constructs multimodal prototypes to adapt vision-language models during the test time.<n>By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning.
arXiv Detail & Related papers (2025-07-04T15:31:47Z) - R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [26.757458496178437]
We introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning.<n>We construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains.<n>We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning.<n> Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL.
arXiv Detail & Related papers (2025-03-13T17:56:05Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z) - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [34.32147663809707]
We introduce LLaVA-CoT, a large Vision-Language Model (VLM) designed to conduct autonomous multistage reasoning.<n>Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.<n>With only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks.
arXiv Detail & Related papers (2024-11-15T18:58:31Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images [72.42826916932519]
We release JourneyBench, a benchmark of generated images to assess the model's fine-grained multimodal reasoning abilities.<n>Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios.<n>Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models.
arXiv Detail & Related papers (2024-09-19T17:58:16Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Argumentative Stance Prediction: An Exploratory Study on Multimodality
and Few-Shot Learning [0.0]
We evaluate the necessity of images for stance prediction in tweets.
Our work suggests an ensemble of fine-tuned text-based language models.
Our findings suggest that the multimodal models tend to perform better when image content is summarized as natural language.
arXiv Detail & Related papers (2023-10-11T00:18:29Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.