My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism
- URL: http://arxiv.org/abs/2012.13872v1
- Date: Sun, 27 Dec 2020 06:19:20 GMT
- Title: My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism
- Authors: Swapnil Parekh, Yaman Kumar Singla, Changyou Chen, Junyi Jessy Li,
Rajiv Ratn Shah
- Abstract summary: Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
- Score: 71.34160809068996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in deep-learning based Automatic Essay
Scoring (AES) systems in the past two decades. However, little research has
been put to understand and interpret the black-box nature of these
deep-learning based scoring models. Recent work shows that automated scoring
systems are prone to even common-sense adversarial samples. Their lack of
natural language understanding capability raises questions on the models being
actively used by millions of candidates for life-changing decisions. With
scoring being a highly multi-modal task, it becomes imperative for scoring
models to be validated and tested on all these modalities. We utilize recent
advances in interpretability to find the extent to which features such as
coherence, content and relevance are important for automated scoring mechanisms
and why they are susceptible to adversarial samples. We find that the systems
tested consider essays not as a piece of prose having the characteristics of
natural flow of speech and grammatical structure, but as `word-soups' where a
few words are much more important than the other words. Removing the context
surrounding those few important words causes the prose to lose the flow of
speech and grammar, however has little impact on the predicted score. We also
find that since the models are not semantically grounded with world-knowledge
and common sense, adding false facts such as ``the world is flat'' actually
increases the score instead of decreasing it.
Related papers
- DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z) - Testing AI on language comprehension tasks reveals insensitivity to underlying meaning [3.335047764053173]
Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education.
Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard.
We systematically assess 7 state-of-the-art models on a novel benchmark.
arXiv Detail & Related papers (2023-02-23T20:18:52Z) - A Linguistic Investigation of Machine Learning based Contradiction
Detection Models: An Empirical Analysis and Future Perspectives [0.34998703934432673]
We analyze two Natural Language Inference data sets with respect to their linguistic features.
The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model.
arXiv Detail & Related papers (2022-10-19T10:06:03Z) - Saliency Map Verbalization: Comparing Feature Importance Representations
from Model-free and Instruction-based Methods [6.018950511093273]
Saliency maps can explain a neural model's predictions by identifying important input features.
We formalize the underexplored task of translating saliency maps into natural language.
We compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations.
arXiv Detail & Related papers (2022-10-13T17:48:15Z) - Explainable Verbal Deception Detection using Transformers [1.5104201344012347]
This paper proposes and evaluates six deep-learning models, including combinations of BERT (and RoBERTa), MultiHead Attention, co-attentions, and transformers.
The findings suggest that our transformer-based models can enhance automated deception detection performances (+2.11% in accuracy)
arXiv Detail & Related papers (2022-10-06T17:36:00Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.