Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect
- URL: http://arxiv.org/abs/2507.10013v1
- Date: Mon, 14 Jul 2025 07:48:54 GMT
- Title: Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect
- Authors: Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef,
- Abstract summary: We re-evaluate the bouba-kiki effect, where humans reliably associate pseudowords like "bouba" with round shapes and "kiki" with jagged ones.<n>We show that vision-and-language models (VLMs) do not consistently exhibit the bouba-kiki effect.
- Score: 0.10923877073891446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like "bouba" with round shapes and "kiki" with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.
Related papers
- Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions [0.03495246564946555]
We introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar.<n>These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions.
arXiv Detail & Related papers (2024-12-11T07:51:18Z) - Cross-Modal Consistency in Multimodal Large Language Models [33.229271701817616]
We introduce a novel concept termed cross-modal consistency.
Our experimental findings reveal a pronounced inconsistency between the vision and language modalities within GPT-4V.
Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
arXiv Detail & Related papers (2024-11-14T08:22:42Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Are Visual-Language Models Effective in Action Recognition? A Comparative Study [22.97135293252601]
This paper provides a large-scale study and insight on state-of-the-art vision foundation models.
It compares their transfer ability onto zero-shot and frame-wise action recognition tasks.
Experiments are conducted on recent fine-grained, human-centric action recognition datasets.
arXiv Detail & Related papers (2024-10-22T16:28:21Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models [56.89974470863207]
This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts.
We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models' latent space.
arXiv Detail & Related papers (2024-08-17T01:43:51Z) - What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models [0.10923877073891446]
Cross-modal preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings.
We probe and compare four vision- and-language (VLM) models for a well-known human cross-modal preference, the bouba-kiki effect.
Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
arXiv Detail & Related papers (2024-07-25T12:09:41Z) - From CNNs to Transformers in Multimodal Human Action Recognition: A Survey [23.674123304219822]
Human action recognition is one of the most widely studied research problems in Computer Vision.
Recent studies have shown that addressing it using multimodal data leads to superior performance.
Recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task.
arXiv Detail & Related papers (2024-05-22T02:11:18Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Adaptive Contextual Perception: How to Generalize to New Backgrounds and
Ambiguous Objects [75.15563723169234]
We investigate how vision models adaptively use context for out-of-distribution generalization.
We show that models that excel in one setting tend to struggle in the other.
To replicate the generalization abilities of biological vision, computer vision models must have factorized object vs. background representations.
arXiv Detail & Related papers (2023-06-09T15:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.