CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
- URL: http://arxiv.org/abs/2201.05320v1
- Date: Fri, 14 Jan 2022 06:49:15 GMT
- Title: CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
- Authors: Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav
Goldberg, Yejin Choi, Jonathan Berant
- Abstract summary: We construct benchmarks that test the abilities of modern natural language understanding models.
In this work, we propose gamification as a framework for data construction.
- Score: 126.85096257968414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Constructing benchmarks that test the abilities of modern natural language
understanding models is difficult - pre-trained language models exploit
artifacts in benchmarks to achieve human parity, but still fail on adversarial
examples and make errors that demonstrate a lack of common sense. In this work,
we propose gamification as a framework for data construction. The goal of
players in the game is to compose questions that mislead a rival AI while using
specific phrases for extra points. The game environment leads to enhanced user
engagement and simultaneously gives the game designer control over the
collected data, allowing us to collect high-quality data at scale. Using our
method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and
demonstrate its difficulty for models that are orders-of-magnitude larger than
the AI used in the game itself. Our best baseline, the T5-based Unicorn with
11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3
(52.9%) in a few-shot inference setup. Both score well below human performance
which is at 94.1%.
Related papers
- From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI [0.0]
We study the effectiveness of large language models (LLMs) on different question answering tasks.
We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets.
Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent"
arXiv Detail & Related papers (2024-07-04T09:38:49Z) - ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales? [7.307538454513983]
This study explores the alignment between ChatGPT and human assessments across multiple scales.
We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores.
Our results show that ChatGPT aligns better with humans in more coarse-grained scales.
arXiv Detail & Related papers (2024-03-26T04:07:08Z) - Toward Efficient Language Model Pretraining and Downstream Adaptation
via Self-Evolution: A Case Study on SuperGLUE [203.65227947509933]
This report describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard.
SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks.
arXiv Detail & Related papers (2022-12-04T15:36:18Z) - Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors [3.39487428163997]
We show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game.
Our results show promising results for employing language models to detect video game bugs.
arXiv Detail & Related papers (2022-10-05T18:44:35Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - Teaching Broad Reasoning Skills via Decomposition-Guided Contexts [50.114651561111245]
Question-answering datasets require a broad set of reasoning skills.
We show how to use question decompositions to teach these broad reasoning skills in a robust fashion.
arXiv Detail & Related papers (2022-05-25T05:13:21Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.