CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding
- URL: http://arxiv.org/abs/2107.09847v3
- Date: Sun, 19 May 2024 05:37:53 GMT
- Title: CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding
- Authors: Minjung Shin, Seongho Choi, Yu-Jung Heo, Minsu Lee, Byoung-Tak Zhang, Jeh-Kwang Ryu,
- Abstract summary: We introduce CogME, a cognition-inspired, multi-dimensional evaluation metric designed for AI models focusing on story understanding.
We argue the need for metrics based on understanding the nature of tasks and designed to align closely with human cognitive processes.
This approach provides insights beyond traditional overall scores and paves the way for more sophisticated AI development targeting higher cognitive functions.
- Score: 19.113385429326808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce CogME, a cognition-inspired, multi-dimensional evaluation metric designed for AI models focusing on story understanding. CogME is a framework grounded in human thinking strategies and story elements that involve story understanding. With a specific breakdown of the questions, this approach provides a nuanced assessment revealing not only AI models' particular strengths and weaknesses but also the characteristics of the benchmark dataset. Our case study with the DramaQA dataset demonstrates a refined analysis of the model and the benchmark dataset. We argue the need for metrics based on understanding the nature of tasks and designed to align closely with human cognitive processes. This approach provides insights beyond traditional overall scores and paves the way for more sophisticated AI development targeting higher cognitive functions.
Related papers
- Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - Exposing Assumptions in AI Benchmarks through Cognitive Modelling [0.0]
Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations.
We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models.
arXiv Detail & Related papers (2024-09-25T11:55:02Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence.
Recent trends demonstrate the potential homogeneity of these two fields.
We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Rethinking Language Models as Symbolic Knowledge Graphs [7.192286645674803]
Symbolic knowledge graphs (KGs) play a pivotal role in knowledge-centric applications such as search, question answering and recommendation.
We construct nine qualitative benchmarks that encompass a spectrum of attributes including symmetry, asymmetry, hierarchy, bidirectionality, compositionality, paths, entity-centricity, bias and ambiguity.
arXiv Detail & Related papers (2023-08-25T21:25:08Z) - Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence.
We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions.
We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z) - Deep Graph Memory Networks for Forgetting-Robust Knowledge Tracing [5.648636668261282]
We propose a novel knowledge tracing model, namely emphDeep Graph Memory Network (DGMN)
In this model, we incorporate a forget gating mechanism into an attention memory structure in order to capture forgetting behaviours.
This model has the capability of learning relationships between latent concepts from a dynamic latent concept graph.
arXiv Detail & Related papers (2021-08-18T12:04:10Z) - KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization
and Completion [99.47414073164656]
A comprehensive knowledge graph (KG) contains an instance-level entity graph and an ontology-level concept graph.
The two-view KG provides a testbed for models to "simulate" human's abilities on knowledge abstraction, concretization, and completion.
We propose a unified KG benchmark by improving existing benchmarks in terms of dataset scale, task coverage, and difficulty.
arXiv Detail & Related papers (2020-04-28T16:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.