StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in
Texts
- URL: http://arxiv.org/abs/2204.08292v1
- Date: Mon, 18 Apr 2022 12:46:46 GMT
- Title: StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in
Texts
- Authors: Zhengxiang Shi, Qiang Zhang, Aldo Lipani
- Abstract summary: In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts.
We also propose a Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks.
- Score: 12.254118455438535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inferring spatial relations in natural language is a crucial ability an
intelligent system should possess. The bAbI dataset tries to capture tasks
relevant to this domain (task 17 and 19). However, these tasks have several
limitations. Most importantly, they are limited to fixed expressions, they are
limited in the number of reasoning steps required to solve them, and they fail
to test the robustness of models to input that contains irrelevant or redundant
information. In this paper, we present a new Question-Answering dataset called
StepGame for robust multi-hop spatial reasoning in texts. Our experiments
demonstrate that state-of-the-art models on the bAbI dataset struggle on the
StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented
Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental
results on both datasets show that our model outperforms all the baselines with
superior generalization and robustness performance.
Related papers
- Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - What's "up" with vision-language models? Investigating their struggle
with spatial reasoning [76.2406963762722]
Three new corpora quantify model comprehension of basic spatial relations.
We evaluate 18 vision-language (VL) models, finding that all perform poorly.
We conclude by studying causes of this surprising behavior.
arXiv Detail & Related papers (2023-10-30T17:50:15Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning [10.810615375345511]
This paper proposes a benchmark for spatial reasoning on natural language text.
We design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs.
Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs' capability on spatial understanding.
arXiv Detail & Related papers (2021-04-12T21:37:18Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.