Related papers: Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

URL: http://arxiv.org/abs/2504.00942v1
Date: Tue, 01 Apr 2025 16:28:38 GMT
Title: Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?
Authors: Anna Bavaresco, Raquel Fernández,
Abstract summary: A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models.<n>We compare word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture information.<n>Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects.
Score: 5.412335160966597
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio -- similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information -- as defined by an existing norm-based 'experiential model' -- and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

Related papers

Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
We investigate whether integration of visuo-linguistic information leads to representations that are more aligned with human brain activity. Our findings indicate an advantage of multimodal models in predicting human brain activations.
arXiv Detail & Related papers (2024-07-25T10:08:37Z)
DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.<n>We show that DevBench provides a benchmark for comparing models to human language development.<n>These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language) We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)
What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge [0.13706331473063876]
We introduce two evaluation tasks for measuring visual commonsense knowledge in language models. We find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.
arXiv Detail & Related papers (2022-05-14T13:37:50Z)
Considerations for Multilingual Wikipedia Research [1.5736899098702972]
Non-English language editions of Wikipedia have led to the inclusion of many more language editions in datasets and models. This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia.
arXiv Detail & Related papers (2022-04-05T20:34:15Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay. Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.