What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models
- URL: http://arxiv.org/abs/2407.17974v1
- Date: Thu, 25 Jul 2024 12:09:41 GMT
- Title: What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models
- Authors: Tessa Verhoef, Kiana Shahrasbi, Tom Kouwenhoven,
- Abstract summary: Cross-modal preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings.
We probe and compare four vision- and-language (VLM) models for a well-known human cross-modal preference, the bouba-kiki effect.
Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
- Score: 0.10923877073891446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
Related papers
- Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect [0.10923877073891446]
We re-evaluate the bouba-kiki effect, where humans reliably associate pseudowords like "bouba" with round shapes and "kiki" with jagged ones.<n>We show that vision-and-language models (VLMs) do not consistently exhibit the bouba-kiki effect.
arXiv Detail & Related papers (2025-07-14T07:48:54Z) - Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages.
We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts.
We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models [16.583370726582356]
We show that Vision Language Models (VLMs) can implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone.
We perform experiments including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks.
Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation.
arXiv Detail & Related papers (2024-09-23T11:13:25Z) - Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
We investigate whether integration of visuo-linguistic information leads to representations that are more aligned with human brain activity.
Our findings indicate an advantage of multimodal models in predicting human brain activations.
arXiv Detail & Related papers (2024-07-25T10:08:37Z) - How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect [2.3622884172290255]
Recent research looking for human-like typicality effects in language and vision models has focused on models of a single modality.
This study expands this behavioral evaluation of models by considering a broader range of language and vision models.
It also evaluates whether the combined typicality predictions of vision + language model pairs, as well as a multimodal CLIP-based model, are better aligned with human typicality judgments than those of models of either modality alone.
arXiv Detail & Related papers (2024-05-25T08:38:30Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - MulCogBench: A Multi-modal Cognitive Benchmark Dataset for Evaluating
Chinese and English Computational Language Models [44.74364661212373]
This paper proposes MulCogBench, a cognitive benchmark dataset collected from native Chinese and English participants.
It encompasses a variety of cognitive data, including subjective semantic ratings, eye-tracking, functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG)
Results show that language models share significant similarities with human cognitive data and the similarity patterns are modulated by the data modality and stimuli complexity.
arXiv Detail & Related papers (2024-03-02T07:49:57Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.