Related papers: VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Related papers

VisualActBench: Can VLMs See and Act like a Human? [47.16421650715271]
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments.<n>However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored.<n>We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions.
arXiv Detail & Related papers (2025-12-10T18:36:18Z)
Task-Aware Resolution Optimization for Visual Large Language Models [57.85959322093884]
Most visual large language models (VLLMs) pre-assume a fixed resolution for downstream tasks, which leads to subpar performance.<n>We propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors.<n>Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution.
arXiv Detail & Related papers (2025-10-10T19:53:30Z)
VIVA+: Human-Centered Situational Decision-Making [9.67738226553979]
We introduce VIVA+, a benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations.<n>Vila+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making.<n>We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges.
arXiv Detail & Related papers (2025-09-28T07:13:11Z)
Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values [36.47201247038004]
We show that Large Language Models (LLMs) exhibit stark deviations from human values in prioritizing various attributes.<n>We show that low-rank supervised fine-tuning with few samples is often effective in improving both decision consistency and calibrating indecision modeling.
arXiv Detail & Related papers (2025-05-30T01:23:11Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment [23.7081830844157]
This study examines the alignment between socio-driven decisions and human judgment in various contexts of the moral machine experiment. We find that the moral decisions of LLMs vary substantially by persona, showing greater shifts in moral decisions for critical tasks than humans. We discuss the ethical implications and risks associated with deploying these models in applications that involve moral decisions.
arXiv Detail & Related papers (2025-04-15T05:29:51Z)
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning [23.7096338281261]
This paper shows that Vision Language Models can achieve surprisingly strong decision-making performance when visual scenes are represented as text-only descriptions.<n>We propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making.
arXiv Detail & Related papers (2025-03-21T09:25:23Z)
Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts [39.72461455275383]
We introduce Value-Spectrum, a benchmark aimed at assessing Vision-Language Models (VLMs) based on Schwartz's value dimensions. We constructed a vectorized database of over 50,000 short videos sourced from TikTok, YouTube Shorts, and Instagram Reels, covering multiple months and a wide array of topics. We also developed a VLM agent pipeline to automate video browsing and analysis.
arXiv Detail & Related papers (2024-11-18T11:31:10Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions [69.9980759344628]
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. We introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities. We present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies.
arXiv Detail & Related papers (2024-06-27T15:01:42Z)
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs) WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z)
Decision Theoretic Foundations for Experiments Evaluating Human Decisions [18.27590643693167]
We argue that to attribute loss in human performance to forms of bias, an experiment must provide participants with the information that a rational agent would need to identify the utility-maximizing decision. As a demonstration, we evaluate the extent to which recent evaluations of decision-making from the literature on AI-assisted decisions achieve these criteria.
arXiv Detail & Related papers (2024-01-25T16:21:37Z)
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models [21.410065053609877]
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. EgoThink is a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions.
arXiv Detail & Related papers (2023-11-27T07:44:25Z)
From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models [10.520548925719565]
We propose to use value-injected large language models (LLM) to predict opinions and behaviors. We conduct a series of experiments on four tasks to test the effectiveness of VIM. Results suggest that opinions and behaviors can be better predicted using value-injected LLMs than the baseline approaches.
arXiv Detail & Related papers (2023-10-27T02:18:10Z)
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another. As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.