Related papers: Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

URL: http://arxiv.org/abs/2209.07760v1
Date: Fri, 16 Sep 2022 07:38:51 GMT
Title: Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios
Authors: Mana Ashida, Saku Sugawara
Abstract summary: This study frames this task by asking multiple questions with the same set of possible endings as candidate answers. Our dataset consists of more than 4.5K questions over 1.3K story texts in English.
Score: 8.553766123004682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset, Possible Stories, consists of more than 4.5K questions over 1.3K story texts in English. We discover that even current strong pretrained language models struggle to answer the questions consistently, highlighting that the highest accuracy in an unsupervised setting (60.2%) is far behind human accuracy (92.5%). Through a comparison with existing datasets, we observe that the questions in our dataset contain minimal annotation artifacts in the answer options. In addition, our dataset includes examples that require counterfactual reasoning, as well as those requiring readers' reactions and fictional information, suggesting that our dataset can serve as a challenging testbed for future studies on situated commonsense reasoning.

Related papers

TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa [13.107551474252379]
We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa.<n>The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer.
arXiv Detail & Related papers (2025-07-23T17:20:28Z)
Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations [3.3018718917393297]
We propose the novel task of Question Answering with source citation in ambiguous settings, where multiple valid answers exist. We create a comprehensive framework consisting of: (1) five novel datasets; (2) the first ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; and (3) two new metrics to evaluate models' performances. We hope that this new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy and interpretable systems.
arXiv Detail & Related papers (2024-10-05T17:37:01Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Qsnail: A Questionnaire Dataset for Sequential Question Generation [76.616068047362]
We present the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires. We conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents. Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires.
arXiv Detail & Related papers (2024-02-22T04:14:10Z)
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension [13.896697187967547]
We crowdsource rationale texts that explain why we should select or eliminate answer options from a logical reading comprehension dataset. Experiments show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly.
arXiv Detail & Related papers (2023-11-30T08:44:55Z)
Zero-shot Clarifying Question Generation for Conversational Search [25.514678546942754]
We propose a constrained clarifying question generation system which uses both question templates and query facets to guide the effective and precise question generation. Experiment results show that our method outperforms existing state-of-the-art zero-shot baselines by a large margin.
arXiv Detail & Related papers (2023-01-30T04:43:02Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers. We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Challenges in Information-Seeking QA: Unanswerable Questions and Paragraph Retrieval [46.3246135936476]
We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise. Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction. We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer.
arXiv Detail & Related papers (2020-10-22T17:48:17Z)
Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document. We show that readers engage in a series of pragmatic strategies to seek information. We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.