Related papers: Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

URL: http://arxiv.org/abs/2504.12805v1
Date: Thu, 17 Apr 2025 10:10:25 GMT
Title: Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation
Authors: Takaya Arita, Wenxian Zheng, Reiji Suzuki, Fuminori Akiba,
Abstract summary: This study explores how large language models (LLMs) perform in two areas related to art.<n>For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories.<n>These critiques were compared with those written by human experts in a Turing test-style evaluation.<n>In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension.
Score: 0.9428222284377783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.

Related papers

How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs [13.822169295436177]
We investigate how large language models (LLMs) process the temporal meaning of linguistic aspect in narratives that were previously used in human studies.<n>Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect.<n>These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding.
arXiv Detail & Related papers (2025-07-18T18:28:35Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
Towards Quantifying Commonsense Reasoning with Mechanistic Insights [7.124379028448955]
We argue that a proxy of commonsense reasoning can be maintained as a graphical structure.<n>We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities.<n>We find that the created resource can be used to frame an enormous number of commonsense queries.
arXiv Detail & Related papers (2025-04-14T10:21:59Z)
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning? [17.647896474008597]
We introduce a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts.<n>We systematically evaluate a wide range of vision-language models through four complementary tasks.<n>Our experiments reveal that even the most advanced models significantly underperform compared to humans.
arXiv Detail & Related papers (2025-03-29T16:08:51Z)
How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition [75.11808682808065]
This study investigates whether large language models (LLMs) exhibit similar tendencies in understanding semantic size.<n>Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding.<n> Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario.
arXiv Detail & Related papers (2025-03-01T03:35:56Z)
The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters [67.61587661660852]
Theory-of-Mind (ToM) allows humans to understand and interpret the mental states of others.<n>In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM.<n>We introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels.
arXiv Detail & Related papers (2025-01-03T09:04:45Z)
Understanding the Dark Side of LLMs' Intrinsic Self-Correction [55.51468462722138]
Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. Recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. We identify intrinsic self-correction can cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions.
arXiv Detail & Related papers (2024-12-19T15:39:31Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
Meaningful Learning: Enhancing Abstract Reasoning in Large Language Models via Generic Fact Guidance [38.49506722997423]
Large language models (LLMs) have developed impressive performance and strong explainability across various reasoning scenarios. LLMs often struggle to abstract and apply the generic fact to provide consistent and precise answers. This has sparked a vigorous debate about whether LLMs are genuinely reasoning or merely memorizing.
arXiv Detail & Related papers (2024-03-14T04:06:13Z)
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z)
Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations [14.685170467182369]
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks. Since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response.
arXiv Detail & Related papers (2023-10-17T12:34:32Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.