VIVA+: Human-Centered Situational Decision-Making
- URL: http://arxiv.org/abs/2509.23698v1
- Date: Sun, 28 Sep 2025 07:13:11 GMT
- Title: VIVA+: Human-Centered Situational Decision-Making
- Authors: Zhe Hu, Yixiao Ren, Guanzhong Liu, Jing Li, Yu Yin,
- Abstract summary: We introduce VIVA+, a benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations.<n>Vila+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making.<n>We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges.
- Score: 9.67738226553979
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model's ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.
Related papers
- Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z) - How Good are Foundation Models in Step-by-Step Embodied Reasoning? [79.15268080287505]
Embodied agents must make decisions that are safe, spatially coherent, and grounded in context.<n>Recent advances in large multimodal models have shown promising capabilities in visual understanding and language generation.<n>Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments.
arXiv Detail & Related papers (2025-09-18T17:56:30Z) - Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality [45.34252727738116]
Mean Opinion Score (MOS) is no longer sufficient as the sole supervisory signal for multimedia quality assessment models.<n>By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.
arXiv Detail & Related papers (2025-05-26T08:52:02Z) - Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities [5.0778942095543576]
This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of Large Language Models.<n>We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3.<n>Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment.
arXiv Detail & Related papers (2025-05-19T14:50:44Z) - Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.77467538102924]
Recent advancements in Large Reasoning Models (LRMs) have demonstrated remarkable performance in specialized reasoning tasks.<n>We show that acquiring deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs.<n>We demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks.
arXiv Detail & Related papers (2025-03-23T08:18:51Z) - Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning [16.938301925105097]
This paper shows that Vision Language Models can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions.<n>We propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values [14.094823787048592]
Large vision language models (VLMs) have demonstrated significant potential for integration into daily life.
This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues.
arXiv Detail & Related papers (2024-07-03T10:59:06Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning [76.3114831562989]
It requires Large Language Model (LLM) agents to adapt their strategies dynamically in multi-agent environments.
We propose a novel framework: "K-Level Reasoning with Large Language Models (K-R)"
arXiv Detail & Related papers (2024-02-02T16:07:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.