Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
- URL: http://arxiv.org/abs/2310.16781v3
- Date: Tue, 2 Apr 2024 05:50:21 GMT
- Title: Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
- Authors: Morris Alper, Hadar Averbuch-Elor,
- Abstract summary: We show that sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion.
Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools.
- Score: 13.300199242824934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.
Related papers
- The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect [4.491391835956324]
We estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns.<n>We find that the implicit representational space of participants and the semantic representations of language models are significantly aligned.<n>Results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.
arXiv Detail & Related papers (2025-05-21T15:57:58Z) - LLMs as a synthesis between symbolic and continuous approaches to language [5.333866030919832]
I argue that deep learning models for language represent a synthesis between the two traditions.
I review recent research in mechanistic interpretability that showcases how a substantial part of morphosyntactic knowledge is encoded in a near-discrete fashion in LLMs.
arXiv Detail & Related papers (2025-02-17T14:48:18Z) - With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models [16.583370726582356]
We show that Vision Language Models (VLMs) can implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone.
We perform experiments including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks.
Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation.
arXiv Detail & Related papers (2024-09-23T11:13:25Z) - Measuring Sound Symbolism in Audio-visual Models [21.876743976994614]
This study investigates whether pre-trained audio-visual models demonstrate associations between sounds and visual representations.
Our findings reveal connections to human language processing, providing insights in cognitive architectures and machine learning strategies.
arXiv Detail & Related papers (2024-09-18T20:33:54Z) - What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models [0.10923877073891446]
Cross-modal preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings.
We probe and compare four vision- and-language (VLM) models for a well-known human cross-modal preference, the bouba-kiki effect.
Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
arXiv Detail & Related papers (2024-07-25T12:09:41Z) - What Drives the Use of Metaphorical Language? Negative Insights from
Abstractness, Affect, Discourse Coherence and Contextualized Word
Representations [13.622570558506265]
Given a specific discourse, which discourse properties trigger the use of metaphorical language, rather than using literal alternatives?
Many NLP approaches to metaphorical language rely on cognitive and (psycho-)linguistic insights and have successfully defined models of discourse coherence, abstractness and affect.
In this work, we build five simple models relying on established cognitive and linguistic properties to predict the use of a metaphorical vs. synonymous literal expression in context.
arXiv Detail & Related papers (2022-05-23T08:08:53Z) - Things not Written in Text: Exploring Spatial Commonsense from Visual
Signals [77.46233234061758]
We investigate whether models with visual signals learn more spatial commonsense than text-based models.
We propose a benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions.
We find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models.
arXiv Detail & Related papers (2022-03-15T17:02:30Z) - Signal in Noise: Exploring Meaning Encoded in Random Character Sequences
with Character-Aware Language Models [0.7454831343436739]
We show that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning within and beyond extant language.
By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams.
arXiv Detail & Related papers (2022-03-15T13:48:38Z) - Emergence of Machine Language: Towards Symbolic Intelligence with Neural
Networks [73.94290462239061]
We propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation.
By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language.
arXiv Detail & Related papers (2022-01-14T14:54:58Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - It's not Rocket Science : Interpreting Figurative Language in Narratives [48.84507467131819]
We study the interpretation of two non-compositional figurative languages (idioms and similes)
Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks.
We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language.
arXiv Detail & Related papers (2021-08-31T21:46:35Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.