Related papers: Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation

Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation

URL: http://arxiv.org/abs/2508.10367v2
Date: Tue, 14 Oct 2025 07:55:45 GMT
Title: Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation
Authors: Pablo Hernández-Cámara, Alexandra Gomez-Villa, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Valero Laparra, Jesus Malo,
Abstract summary: We introduce a behavioural method for estimating the Contrast Sensitivity Function (CSF) in Multimodal Large Language Models (MLLMs)<n>Models are queried with structured prompts while viewing noise-based stimuli filtered at specific spatial frequencies.<n>Our results reveal that some models resemble human CSFs in shape or scale, but none capture both.
Score: 37.9406446788251
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how Multimodal Large Language Models (MLLMs) process low-level visual features is critical for evaluating their perceptual abilities and has not been systematically characterized. Inspired by human psychophysics, we introduce a behavioural method for estimating the Contrast Sensitivity Function (CSF) in MLLMs by treating them as end-to-end observers. Models are queried with structured prompts while viewing noise-based stimuli filtered at specific spatial frequencies. Psychometric functions are derived from the binary verbal responses, and contrast thresholds (and CSFs) are obtained without relying on internal activations or classifier-based proxies. Our results reveal that some models resemble human CSFs in shape or scale, but none capture both. We also find that CSF estimates are highly sensitive to prompt phrasing, indicating limited linguistic robustness. Finally, we show that CSFs predict model performance under frequency-filtered and adversarial conditions. These findings highlight systematic differences in frequency tuning across MLLMs and establish CSF estimation as a scalable diagnostic tool for multimodal perception.

Related papers

Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis [63.56564189749175]
We analyze speech embeddings from structured clinical interviews of psychotic patients and healthy controls.<n>Lyapunov exponent (LE) spectra are computed from word-level and answer-level embeddings.
arXiv Detail & Related papers (2026-02-18T08:46:46Z)
E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis [54.763420895859035]
We present ELLM2-EEG-to-Emotion Large Language Model, first MLLM framework for interpretable emotion analysis from EEG.<n>ELLM integrates a pretrained EEG encoder with Q-based LLMs through learnable projection layers, employing a multi-stage training pipeline.<n>Experiments on the dataset across seven emotion categories demonstrate that ELLM2-EEG-to-Emotion Large Language Model achieves excellent performance on emotion classification.
arXiv Detail & Related papers (2026-01-11T13:21:20Z)
FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering [14.550872089352943]
FaithSCAN is a lightweight network that detects hallucinations by exploiting rich internal signals of vision-language models.<n>We extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals.<n>In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding.
arXiv Detail & Related papers (2026-01-01T09:19:39Z)
Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models [43.46006663176283]
This work systematically analyzes the impact of various perturbations on medical MLLMs.<n>For the visual modality, we propose a Perturbation-aware Denoising (PDC) which leverages MLLMs' own vision encoder to identify noise patterns.<n>For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text.
arXiv Detail & Related papers (2025-12-26T10:23:30Z)
Suppressing VLM Hallucinations with Spectral Representation Filtering [49.52703800684483]
Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image.<n>We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations.
arXiv Detail & Related papers (2025-11-15T13:49:27Z)
Benchmarking and Mitigate Sycophancy in Medical Vision-Language Models [21.353225217216252]
Vision language models often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning.<n>This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark.
arXiv Detail & Related papers (2025-09-26T07:02:22Z)
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection [1.660734109310745]
We proposed a novel framework that combines Large Language Models (LLMs) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning.<n>Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.
arXiv Detail & Related papers (2025-09-22T00:18:58Z)
SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning [50.69448058071441]
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience.<n>We propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses.<n>We show that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance.
arXiv Detail & Related papers (2025-08-14T03:01:05Z)
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z)
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks.<n>We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks.<n>We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z)
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations [44.83933994734478]
Large language models (MLLMs) have demonstrated remarkable performance in visual tasks.<n>However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations.<n>We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects.
arXiv Detail & Related papers (2025-03-19T04:39:45Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
UniAR: A Unified model for predicting human Attention and Responses on visual content [12.281060227170792]
We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
arXiv Detail & Related papers (2023-12-15T19:57:07Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z)
Multimodal perception for dexterous manipulation [14.314776558032166]
We propose a cross-modal sensory data generation framework for the translation between vision and touch. We propose a-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration.
arXiv Detail & Related papers (2021-12-28T21:20:26Z)
A Psychophysically Oriented Saliency Map Prediction Model [4.884688557957589]
We propose a new psychophysical saliency prediction architecture, WECSF, inspired by multi-channel model of visual cortex functioning in humans. The proposed model is evaluated using several datasets, including the MIT1003, MIT300, Toronto, SID4VAM, and UCF Sports datasets. Our model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images and dynamic videos.
arXiv Detail & Related papers (2020-11-08T20:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.