Related papers: Trusting Language Models in Education

Trusting Language Models in Education

URL: http://arxiv.org/abs/2308.03866v1
Date: Mon, 7 Aug 2023 18:27:54 GMT
Title: Trusting Language Models in Education
Authors: Jogi Suda Neto, Li Deng, Thejaswi Raya, Reza Shahbazi, Nick Liu, Adhitya Venkatesh, Miral Shah, Neeru Khosla, Rodrigo Capobianco Guido
Abstract summary: We propose to use an XGBoost on top of BERT to output the corrected probabilities. Our hypothesis is that the level of uncertainty contained in the flow of attention is related to the quality of the model's response itself.
Score: 1.2578554943276923
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language Models are being widely used in Education. Even though modern deep learning models achieve very good performance on question-answering tasks, sometimes they make errors. To avoid misleading students by showing wrong answers, it is important to calibrate the confidence - that is, the prediction probability - of these models. In our work, we propose to use an XGBoost on top of BERT to output the corrected probabilities, using features based on the attention mechanism. Our hypothesis is that the level of uncertainty contained in the flow of attention is related to the quality of the model's response itself.

Related papers

Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models [6.144680854063938]
Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in language models. We tested three state-of-the-art Visual Language Models on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty.
arXiv Detail & Related papers (2025-04-04T13:31:08Z)
The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration [5.616884466478886]
Pre-trained language models (PLMs) have enabled significant performance gains in the field of natural language processing. Recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. This paper investigates whether lower calibration error implies reliable decision rules for a language model.
arXiv Detail & Related papers (2024-12-17T08:04:28Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
Beyond Confidence: Reliable Models Should Also Consider Atypicality [43.012818086415514]
We investigate the relationship between how atypical(rare) a sample or a class is and the reliability of a model's predictions. We show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. We propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance.
arXiv Detail & Related papers (2023-05-29T17:37:09Z)
Do Not Trust a Model Because It is Confident: Uncovering and Characterizing Unknown Unknowns to Student Success Predictors in Online-Based Learning [10.120425915106727]
Student success models might be prone to develop weak spots, i.e., examples hard to accurately classify. This weakness is one of the main factors undermining users' trust, since model predictions could for instance lead an instructor to not intervene on a student in need. In this paper, we unveil the need of detecting and characterizing unknown unknowns in student success prediction.
arXiv Detail & Related papers (2022-12-16T15:32:49Z)
Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Learning from others' mistakes: Avoiding dataset biases without modeling them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.