Automatic Pharma News Categorization
- URL: http://arxiv.org/abs/2201.00688v1
- Date: Tue, 28 Dec 2021 08:42:16 GMT
- Title: Automatic Pharma News Categorization
- Authors: Stanislaw Adaszewski, Pascal Kuner, Ralf J. Jaeger
- Abstract summary: We use a text dataset consisting of 23 news categories relevant to pharma information science.
We compare the fine-tuning performance of multiple transformer models in a classification task.
We propose an ensemble model consisting of the top performing individual predictors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We use a text dataset consisting of 23 news categories relevant to pharma
information science, in order to compare the fine-tuning performance of
multiple transformer models in a classification task. Using a well-balanced
dataset with multiple autoregressive and autocoding transformation models, we
compare their fine-tuning performance. To validate the winning approach, we
perform diagnostics of model behavior on mispredicted instances, including
inspection of category-wise metrics, evaluation of prediction certainty and
assessment of latent space representations. Lastly, we propose an ensemble
model consisting of the top performing individual predictors and demonstrate
that this approach offers a modest improvement in the F1 metric.
Related papers
- TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of
Social Media Posts [14.201816626446885]
We present a novel approach incorporating transformer-based language models into infectious disease modelling.
We benchmark these clustered embedding features against features extracted from other high-quality datasets.
In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals.
arXiv Detail & Related papers (2022-05-20T18:59:04Z) - Using Explainable Boosting Machine to Compare Idiographic and Nomothetic
Approaches for Ecological Momentary Assessment Data [2.0824228840987447]
This paper explores the use of non-linear interpretable machine learning (ML) models in classification problems.
Various ensembles of trees are compared to linear models using imbalanced synthetic and real-world datasets.
In one of the two real-world datasets, knowledge distillation method achieves improved AUC scores.
arXiv Detail & Related papers (2022-04-04T17:56:37Z) - Vehicle Behavior Prediction and Generalization Using Imbalanced Learning
Techniques [1.3381749415517017]
This paper proposes an interaction-aware prediction model consisting of an LSTM autoencoder and SVM classifier.
Evaluations show that the method enhances model performance, resulting in improved classification accuracy.
arXiv Detail & Related papers (2021-09-22T11:21:20Z) - Benchmarking AutoML Frameworks for Disease Prediction Using Medical
Claims [7.219529711278771]
We generated a large dataset using historical administrative claims including demographic information and flags for disease codes.
We trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.
All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high.
arXiv Detail & Related papers (2021-07-22T07:34:48Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.