SCDE: Sentence Cloze Dataset with High Quality Distractors From
Examinations
- URL: http://arxiv.org/abs/2004.12934v1
- Date: Mon, 27 Apr 2020 16:48:54 GMT
- Title: SCDE: Sentence Cloze Dataset with High Quality Distractors From
Examinations
- Authors: Xiang Kong, Varun Gangal, Eduard Hovy
- Abstract summary: We introduce SCDE, a dataset to evaluate the performance of computational models through sentence prediction.
SCDE is a human-created sentence cloze dataset, collected from public school English examinations.
- Score: 30.86193649398141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SCDE, a dataset to evaluate the performance of computational
models through sentence prediction. SCDE is a human-created sentence cloze
dataset, collected from public school English examinations. Our task requires a
model to fill up multiple blanks in a passage from a shared candidate set with
distractors designed by English teachers. Experimental results demonstrate that
this task requires the use of non-local, discourse-level context beyond the
immediate sentence neighborhood. The blanks require joint solving and
significantly impair each other's context. Furthermore, through ablations, we
show that the distractors are of high quality and make the task more
challenging. Our experiments show that there is a significant performance gap
between advanced models (72%) and humans (87%), encouraging future models to
bridge this gap.
Related papers
- ProsAudit, a prosodic benchmark for self-supervised speech models [14.198508548718676]
ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning (SSL) speech models.
It consists of two subtasks, their corresponding metrics, and an evaluation dataset.
arXiv Detail & Related papers (2023-02-23T14:30:23Z) - Findings on Conversation Disentanglement [28.874162427052905]
We build a learning model that learns utterance-to-utterance and utterance-to-thread classification.
Experiments on the Ubuntu IRC dataset show that this approach has the potential to outperform the conventional greedy approach.
arXiv Detail & Related papers (2021-12-10T05:54:48Z) - Agreeing to Disagree: Annotating Offensive Language Datasets with
Annotators' Disagreement [7.288480094345606]
We focus on the level of agreement among annotators while selecting data to create offensive language datasets.
Our study comprises the creation of three novel datasets of English tweets covering different topics.
We show that such hard cases, where low agreement is present, are not necessarily due to poor-quality annotation.
arXiv Detail & Related papers (2021-09-28T08:55:04Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z) - How to Probe Sentence Embeddings in Low-Resource Languages: On
Structural Design Choices for Probing Task Evaluation [82.96358326053115]
We investigate sensitivity of probing task results to structural design choices.
We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English.
We find that results on English do not transfer to other languages.
arXiv Detail & Related papers (2020-06-16T12:37:50Z) - Pretraining with Contrastive Sentence Objectives Improves Discourse
Performance of Language Models [29.40992909208733]
We propose CONPONO, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences.
On the discourse representation benchmark DiscoEval, our model improves over the previous state-of-the-art by up to 13%.
We also show that CONPONO yields gains of 2%-6% absolute even for tasks that do not explicitly evaluate discourse.
arXiv Detail & Related papers (2020-05-20T23:21:43Z) - Toward Better Storylines with Sentence-Level Language Models [54.91921545103256]
We propose a sentence-level language model which selects the next sentence in a story from a finite set of fluent alternatives.
We demonstrate the effectiveness of our approach with state-of-the-art accuracy on the unsupervised Story Cloze task.
arXiv Detail & Related papers (2020-05-11T16:54:19Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.