Multimodality and Attention Increase Alignment in Natural Language
Prediction Between Humans and Computational Models
- URL: http://arxiv.org/abs/2308.06035v3
- Date: Tue, 2 Jan 2024 15:33:20 GMT
- Title: Multimodality and Attention Increase Alignment in Natural Language
Prediction Between Humans and Computational Models
- Authors: Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher
Edwards, Quitterie Lacome DEstalenx, Akilles Rechardt, Jeremy I Skipper and
Gabriella Vigliocco
- Abstract summary: Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words.
multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities.
We show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts.
- Score: 0.8139163264824348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The potential of multimodal generative artificial intelligence (mAI) to
replicate human grounded language understanding, including the pragmatic,
context-rich aspects of communication, remains to be clarified. Humans are
known to use salient multimodal features, such as visual cues, to facilitate
the processing of upcoming words. Correspondingly, multimodal computational
models can integrate visual and linguistic data using a visual attention
mechanism to assign next-word probabilities. To test whether these processes
align, we tasked both human participants (N = 200) as well as several
state-of-the-art computational models with evaluating the predictability of
forthcoming words after viewing short audio-only or audio-visual clips with
speech. During the task, the model's attention weights were recorded and human
attention was indexed via eye tracking. Results show that predictability
estimates from humans aligned more closely with scores generated from
multimodal models vs. their unimodal counterparts. Furthermore, including an
attention mechanism doubled alignment with human judgments when visual and
linguistic context facilitated predictions. In these cases, the model's
attention patches and human eye tracking significantly overlapped. Our results
indicate that improved modeling of naturalistic language processing in mAI does
not merely depend on training diet but can be driven by multimodality in
combination with attention-based architectures. Humans and computational models
alike can leverage the predictive constraints of multimodal information by
attending to relevant features in the input.
Related papers
- Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning.
We conduct a comprehensive evaluation of competitive language and vision-language models.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - Are Human Conversations Special? A Large Language Model Perspective [8.623471682333964]
This study analyzes changes in the attention mechanisms of large language models (LLMs) when used to understand natural conversations between humans (human-human)
Our findings reveal that while language models exhibit domain-specific attention behaviors, there is a significant gap in their ability to specialize in human conversations.
arXiv Detail & Related papers (2024-03-08T04:44:25Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - A Comparative Study on Textual Saliency of Styles from Eye Tracking,
Annotations, and Language Models [21.190423578990824]
We present eyeStyliency, an eye-tracking dataset for human processing of stylistic text.
We develop a variety of methods to derive style saliency scores over text using the collected eye dataset.
We find that while eye-tracking data is unique, it also intersects with both human annotations and model-based importance scores.
arXiv Detail & Related papers (2022-12-19T21:50:36Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.