Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction
- URL: http://arxiv.org/abs/2211.07047v2
- Date: Tue, 15 Nov 2022 20:08:01 GMT
- Title: Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction
- Authors: Grace Yang, Ming Cao, Lavender Y. Jiang, Xujin C. Liu, Alexander T.M.
Cheung, Hannah Weiss, David Kurland, Kyunghyun Cho, Eric K. Oermann
- Abstract summary: We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
- Score: 86.15787587540132
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditional evaluation metrics for classification in natural language
processing such as accuracy and area under the curve fail to differentiate
between models with different predictive behaviors despite their similar
performance metrics. We introduce sensitivity score, a metric that scrutinizes
models' behaviors at the vocabulary level to provide insights into disparities
in their decision-making logic. We assess the sensitivity score on a set of
representative words in the test set using two classifiers trained for hospital
readmission classification with similar performance statistics. Our experiments
compare the decision-making logic of clinicians and classifiers based on rank
correlations of sensitivity scores. The results indicate that the language
model's sensitivity score aligns better with the professionals than the xgboost
classifier on tf-idf embeddings, which suggests that xgboost uses some spurious
features. Overall, this metric offers a novel perspective on assessing models'
robustness by quantifying their discrepancy with professional opinions. Our
code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).
Related papers
- Knowledge Trees: Gradient Boosting Decision Trees on Knowledge Neurons
as Probing Classifier [0.0]
Logistic regression on the output representation of the transformer neural network layer is most often used to probing the syntactic properties of the language model.
We show that using gradient boosting decision trees at the Knowledge Neuron layer is more advantageous than using logistic regression on the output representations of the transformer layer.
arXiv Detail & Related papers (2023-12-17T15:37:03Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Understanding and Mitigating Spurious Correlations in Text
Classification with Neighborhood Analysis [69.07674653828565]
Machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances.
In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis.
We propose a family of regularization methods, NFL (doN't Forget your Language) to mitigate spurious correlations in text classification.
arXiv Detail & Related papers (2023-05-23T03:55:50Z) - DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation [24.224114300690758]
We propose DEnsity, which evaluates a response by utilizing density estimation on the feature space derived from a neural classifier.
Our metric measures how likely a response would appear in the distribution of human conversations.
arXiv Detail & Related papers (2023-05-08T14:10:40Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Perturbations and Subpopulations for Testing Robustness in Token-Based
Argument Unit Recognition [6.502694770864571]
Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against.
One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences.
Previous research suggests that fine-tuning language models on the token-level yields more robust results for classifying sentences compared to training on sentences directly.
We reproduce the study that originally made this claim and further investigate what exactly token-based systems learned better compared to sentence-based ones.
arXiv Detail & Related papers (2022-09-29T13:44:28Z) - Rethinking and Refining the Distinct Metric [61.213465863627476]
We refine the calculation of distinct scores by re-scaling the number of distinct tokens based on its expectation.
We provide both empirical and theoretical evidence to show that our method effectively removes the biases exhibited in the original distinct score.
arXiv Detail & Related papers (2022-02-28T07:36:30Z) - More Than Words: Towards Better Quality Interpretations of Text
Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.