Does Conceptual Representation Require Embodiment? Insights From Large
Language Models
- URL: http://arxiv.org/abs/2305.19103v3
- Date: Fri, 1 Dec 2023 13:25:03 GMT
- Title: Does Conceptual Representation Require Embodiment? Insights From Large
Language Models
- Authors: Qihui Xu, Yingying Peng, Samuel A. Nastase, Martin Chodorow, Minghua
Wu, and Ping Li
- Abstract summary: We compare representations of 4,442 lexical concepts between humans and ChatGPTs (GPT-3.5 and GPT-4)
We identify two main findings: 1) Both models strongly align with human representations in non-sensorimotor domains but lag in sensory and motor areas, with GPT-4 outperforming GPT-3.5; 2) GPT-4's gains are associated with its additional visual learning, which also appears to benefit related dimensions like haptics and imageability.
- Score: 9.390117546307042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To what extent can language alone give rise to complex concepts, or is
embodied experience essential? Recent advancements in large language models
(LLMs) offer fresh perspectives on this question. Although LLMs are trained on
restricted modalities, they exhibit human-like performance in diverse
psychological tasks. Our study compared representations of 4,442 lexical
concepts between humans and ChatGPTs (GPT-3.5 and GPT-4) across multiple
dimensions, including five key domains: emotion, salience, mental
visualization, sensory, and motor experience. We identify two main findings: 1)
Both models strongly align with human representations in non-sensorimotor
domains but lag in sensory and motor areas, with GPT-4 outperforming GPT-3.5;
2) GPT-4's gains are associated with its additional visual learning, which also
appears to benefit related dimensions like haptics and imageability. These
results highlight the limitations of language in isolation, and that the
integration of diverse modalities of inputs leads to a more human-like
conceptual representation.
Related papers
- Nonverbal Interaction Detection [83.40522919429337]
This work addresses a new challenge of understanding human nonverbal interaction in social contexts.
We contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups.
Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form individual, group, interaction> from images.
Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs.
arXiv Detail & Related papers (2024-07-11T02:14:06Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - Exploring Spatial Schema Intuitions in Large Language and Vision Models [8.944921398608063]
We investigate whether large language models (LLMs) effectively capture implicit human intuitions about building blocks of language.
Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences.
This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and computations made by large language models.
arXiv Detail & Related papers (2024-02-01T19:25:50Z) - Human vs. LMMs: Exploring the Discrepancy in Emoji Interpretation and Usage in Digital Communication [68.40865217231695]
This study examines the behavior of GPT-4V in replicating human-like use of emojis.
The findings reveal a discernible discrepancy between human and GPT-4V behaviors, likely due to the subjective nature of human interpretation.
arXiv Detail & Related papers (2024-01-16T08:56:52Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - What's Next in Affective Modeling? Large Language Models [3.0902630634005797]
GPT-4 performs well across multiple emotion tasks.
It can distinguish emotion theories and come up with emotional stories.
We suggest that LLMs could play an important role in affective modeling.
arXiv Detail & Related papers (2023-10-03T16:39:20Z) - Fine-grained Affective Processing Capabilities Emerging from Large
Language Models [7.17010996725842]
We explore ChatGPT's zero-shot ability to perform affective computing tasks using prompting alone.
We show that ChatGPT a) performs meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions, b) has meaningful emotion representations in terms of emotion categories, and c) can perform basic appraisal-based emotion elicitation of situations.
arXiv Detail & Related papers (2023-09-04T15:32:47Z) - Large language models predict human sensory judgments across six
modalities [12.914521751805658]
We show that state-of-the-art large language models can unlock new insights into the problem of recovering the perceptual world from language.
We elicit pairwise similarity judgments from GPT models across six psychophysical datasets.
We show that the judgments are significantly correlated with human data across all domains, recovering well-known representations like the color wheel and pitch spiral.
arXiv Detail & Related papers (2023-02-02T18:32:46Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks.
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models.
Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z) - Modality-Transferable Emotion Embeddings for Low-Resource Multimodal
Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues.
Our model achieves state-of-the-art performance on most of the emotion categories.
Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.