Visual Persuasion: What Influences Decisions of Vision-Language Models?
- URL: http://arxiv.org/abs/2602.15278v1
- Date: Tue, 17 Feb 2026 00:33:53 GMT
- Title: Visual Persuasion: What Influences Decisions of Vision-Language Models?
- Authors: Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh,
- Abstract summary: We study the structure of visual preferences of vision-language models (VLMs)<n>Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference.<n>We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection.
- Score: 27.526872154566316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.
Related papers
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection [26.020338338880034]
We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image.<n>Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes.
arXiv Detail & Related papers (2025-12-15T12:04:26Z) - Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification [22.871255950998016]
We introduce a novel framework for inference-time visual tokens scaling that enables MLLMs to perform verifier-guided reasoning over visual content.<n>Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks.<n>These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
arXiv Detail & Related papers (2025-06-08T17:38:49Z) - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z) - Exploring the Design Space of Visual Context Representation in Video MLLMs [102.11582556690388]
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks.
Visual context representation refers to the scheme to select frames from a video and further select the tokens from a frame.
In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes.
arXiv Detail & Related papers (2024-10-17T15:59:52Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations [35.458709912618176]
Deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features.
For safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently.
We address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation.
arXiv Detail & Related papers (2023-11-29T17:35:29Z) - Sparse Visual Counterfactual Explanations in Image Space [50.768119964318494]
We present a novel model for visual counterfactual explanations in image space.
We show that it can be used to detect undesired behavior of ImageNet classifiers due to spurious features in the ImageNet dataset.
arXiv Detail & Related papers (2022-05-16T20:23:11Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Look here! A parametric learning based approach to redirect visual
attention [49.609412873346386]
We introduce an automatic method to make an image region more attention-capturing via subtle image edits.
Our model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions.
Our edits enable inference at interactive rates on any image size, and easily generalize to videos.
arXiv Detail & Related papers (2020-08-12T16:08:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.