JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions
- URL: http://arxiv.org/abs/2210.15456v2
- Date: Fri, 26 May 2023 05:40:19 GMT
- Title: JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions
- Authors: Mo Yu, Yi Gu, Xiaoxiao Guo, Yufei Feng, Xiaodan Zhu, Michael
Greenspan, Murray Campbell, Chuang Gan
- Abstract summary: We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
- Score: 75.42526766746515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commonsense reasoning simulates the human ability to make presumptions about
our physical world, and it is an essential cornerstone in building general AI
systems. We propose a new commonsense reasoning dataset based on human's
Interactive Fiction (IF) gameplay walkthroughs as human players demonstrate
plentiful and diverse commonsense reasoning. The new dataset provides a natural
mixture of various reasoning types and requires multi-hop reasoning. Moreover,
the IF game-based construction procedure requires much less human interventions
than previous ones. Different from existing benchmarks, our dataset focuses on
the assessment of functional commonsense knowledge rules rather than factual
knowledge. Hence, in order to achieve higher performance on our tasks, models
need to effectively utilize such functional knowledge to infer the outcomes of
actions, rather than relying solely on memorizing facts. Experiments show that
the introduced dataset is challenging to previous machine reading models as
well as the new large language models with a significant 20% performance gap
compared to human experts.
Related papers
- Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework.
It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models.
Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models [22.0839948292609]
We introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern language models.
This dataset is constructed by infusing original questions with various types such as numerical and counter-language queries.
Our evaluations of contemporary vision models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease.
arXiv Detail & Related papers (2023-10-10T13:45:59Z) - Towards A Unified Agent with Foundation Models [18.558328028366816]
We investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents.
We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges.
We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets.
arXiv Detail & Related papers (2023-07-18T22:37:30Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - A Benchmark for Compositional Visual Reasoning [5.576460160219606]
We introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards more data-efficient learning algorithms.
We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and associated image datasets at scale.
Our proposed benchmark includes measures of sample efficiency, generalization and transfer across task rules, as well as the ability to leverage compositionality.
arXiv Detail & Related papers (2022-06-11T00:04:49Z) - Deriving Commonsense Inference Tasks from Interactive Fictions [44.15655034882293]
We propose a new commonsense reasoning dataset based on human's interactive fiction game playings.
Experiments show that our task is solvable to human experts with sufficient commonsense knowledge but poses challenges to existing machine reading models.
arXiv Detail & Related papers (2020-10-19T19:02:34Z) - LogiQA: A Challenge Dataset for Machine Reading Comprehension with
Logical Reasoning [20.81312285957089]
We build a comprehensive dataset, named LogiQA, which is sourced from expert-written questions for testing human logical reasoning.
Results show that state-of-the-art neural models perform by far worse than human ceiling.
Our dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
arXiv Detail & Related papers (2020-07-16T05:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.