Cascade Neural Ensemble for Identifying Scientifically Sound Articles
- URL: http://arxiv.org/abs/2004.06222v1
- Date: Mon, 13 Apr 2020 22:23:04 GMT
- Title: Cascade Neural Ensemble for Identifying Scientifically Sound Articles
- Authors: Ashwin Karthik Ambalavanan, Murthy Devarakonda
- Abstract summary: A barrier to conducting systematic reviews and meta-analysis is efficiently finding scientifically sound relevant articles.
We trained and tested several ensemble architectures of SciBERT on a dataset of about 50K articles from MEDLINE.
The cascade ensemble architecture achieved 0.7505 F measure, an impressive 49.1% error rate reduction.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: A significant barrier to conducting systematic reviews and
meta-analysis is efficiently finding scientifically sound relevant articles.
Typically, less than 1% of articles match this requirement which leads to a
highly imbalanced task. Although feature-engineered and early neural networks
models were studied for this task, there is an opportunity to improve the
results.
Methods: We framed the problem of filtering articles as a classification
task, and trained and tested several ensemble architectures of SciBERT, a
variant of BERT pre-trained on scientific articles, on a manually annotated
dataset of about 50K articles from MEDLINE. Since scientifically sound articles
are identified through a multi-step process we proposed a novel cascade
ensemble analogous to the selection process. We compared the performance of the
cascade ensemble with a single integrated model and other types of ensembles as
well as with results from previous studies.
Results: The cascade ensemble architecture achieved 0.7505 F measure, an
impressive 49.1% error rate reduction, compared to a CNN model that was
previously proposed and evaluated on a selected subset of the 50K articles. On
the full dataset, the cascade ensemble achieved 0.7639 F measure, resulting in
an error rate reduction of 19.7% compared to the best performance reported in a
previous study that used the full dataset.
Conclusion: Pre-trained contextual encoder neural networks (e.g. SciBERT)
perform better than the models studied previously and manually created search
filters in filtering for scientifically sound relevant articles. The superior
performance achieved by the cascade ensemble is a significant result that
generalizes beyond this task and the dataset, and is analogous to query
optimization in IR and databases.
Related papers
- Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection [65.96556073745197]
DiverSified File selection algorithm (DiSF) is proposed to select the most decorrelated text files in the feature space.
DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget.
arXiv Detail & Related papers (2025-04-29T11:13:18Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.
We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - Comparative Analysis and Ensemble Enhancement of Leading CNN Architectures for Breast Cancer Classification [0.0]
This study introduces a novel and accurate approach to breast cancer classification using histopathology images.
It systematically compares leading Convolutional Neural Network (CNN) models across varying image datasets.
Our findings establish the settings required to achieve exceptional classification accuracy for standalone CNN models.
arXiv Detail & Related papers (2024-10-04T11:31:43Z) - From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [19.070305201045954]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation.
We observe that only a subset of the data in constructed datasets plays a decisive role.
We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z) - Personalized Decentralized Multi-Task Learning Over Dynamic
Communication Graphs [59.96266198512243]
We propose a decentralized and federated learning algorithm for tasks that are positively and negatively correlated.
Our algorithm uses gradients to calculate the correlations among tasks automatically, and dynamically adjusts the communication graph to connect mutually beneficial tasks and isolate those that may negatively impact each other.
We conduct experiments on a synthetic Gaussian dataset and a large-scale celebrity attributes (CelebA) dataset.
arXiv Detail & Related papers (2022-12-21T18:58:24Z) - Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions.
We propose deep negative correlation classification (DNCC)
DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z) - Learning brain MRI quality control: a multi-factorial generalization
problem [0.0]
This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets.
We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them.
We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data.
arXiv Detail & Related papers (2022-05-31T15:46:44Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - Novel Human-Object Interaction Detection via Adversarial Domain
Generalization [103.55143362926388]
We study the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.
The challenge mainly stems from the large compositional space of objects and predicates, which leads to the lack of sufficient training data for all the object-predicate combinations.
We propose a unified framework of adversarial domain generalization to learn object-invariant features for predicate prediction.
arXiv Detail & Related papers (2020-05-22T22:02:56Z) - Large-scale empirical validation of Bayesian Network structure learning
algorithms with noisy data [9.04391541965756]
This paper investigates the performance of 15 structure learning algorithms.
Each algorithm is tested over multiple case studies, sample sizes, types of noise, and assessed with multiple evaluation criteria.
Results suggest traditional synthetic performance may overestimate real-world performance by anywhere between 10% and more than 50%.
arXiv Detail & Related papers (2020-05-18T18:40:09Z) - Question Type Classification Methods Comparison [0.0]
The paper presents a comparative study of state-of-the-art approaches for question classification task: Logistic Regression, Convolutional Neural Networks (CNN), Long Short-Term Memory Network (LSTM) and Quasi-Recurrent Neural Networks (QRNN)
All models use pre-trained GLoVe word embeddings and trained on human-labeled data.
The best accuracy is achieved using CNN model with five convolutional layers and various kernel sizes stacked in parallel, followed by one fully connected layer.
arXiv Detail & Related papers (2020-01-03T00:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.