Adversarially Probing Cross-Family Sound Symbolism in 27 Languages
- URL: http://arxiv.org/abs/2512.12245v1
- Date: Sat, 13 Dec 2025 09:06:50 GMT
- Title: Adversarially Probing Cross-Family Sound Symbolism in 27 Languages
- Authors: Anika Sharma, Tianyi Niu, Emma Wrenn, Shashank Srivastava,
- Abstract summary: We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size.<n>We find that phonological form predicts size semantics above chance even across unrelated languages.<n>To probe beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal.
- Score: 8.003991476447572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.
Related papers
- Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling [1.7723990552388866]
This study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones.<n>The results show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.
arXiv Detail & Related papers (2025-03-29T17:39:55Z) - Kiki or Bouba? Sound Symbolism in Vision-and-Language Models [13.300199242824934]
We show that sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion.
Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools.
arXiv Detail & Related papers (2023-10-25T17:15:55Z) - Colexifications for Bootstrapping Cross-lingual Datasets: The Case of
Phonology, Concreteness, and Affectiveness [6.790979602996742]
Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings.
We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world.
The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features.
arXiv Detail & Related papers (2023-06-05T07:32:21Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Evaluating Models of Robust Word Recognition with Serial Reproduction [8.17947290421835]
We compare several broad-coverage probabilistic generative language models in their ability to capture human linguistic expectations.
We find that those models that make use of abstract representations of preceding linguistic context best predict the changes made by people in the course of serial reproduction.
arXiv Detail & Related papers (2021-01-24T20:16:12Z) - Deciphering Undersegmented Ancient Scripts Using Phonetic Prior [31.707254394215283]
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges.
We propose a model that handles both of these challenges by building on rich linguistic constraints.
We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian)
arXiv Detail & Related papers (2020-10-21T15:03:52Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.