Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024
- URL: http://arxiv.org/abs/2506.09381v1
- Date: Wed, 11 Jun 2025 04:05:57 GMT
- Title: Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024
- Authors: Austin McCutcheon, Thiago E. A. de Oliveira, Aleksandr Zheleznov, Chris Brogly,
- Abstract summary: The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links.<n>We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings.
- Score: 41.94295877935867
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.
Related papers
- A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News [1.8737506366172099]
This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis.<n>We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization.<n>The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks.
arXiv Detail & Related papers (2025-11-23T21:22:56Z) - Classification of worldwide news articles by perceived quality, 2018-2024 [0.0]
3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles.<n>Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each.
arXiv Detail & Related papers (2025-11-20T14:41:41Z) - A Regularized LSTM Method for Detecting Fake News Articles [0.0]
This paper develops an advanced machine learning solution for detecting fake news articles.
We leverage a comprehensive dataset of news articles, including 23,502 fake news articles and 21,417 accurate news articles.
Our work highlights the potential for deploying such models in real-world applications.
arXiv Detail & Related papers (2024-11-16T05:54:36Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - Multi-channel CNN to classify nepali covid-19 related tweets using
hybrid features [1.713291434132985]
We represent each tweet by combining both syntactic and semantic information, called hybrid features.
We design a novel multi-channel convolutional neural network (MCNN), which ensembles the multiple CNNs.
We evaluate the efficacy of both the proposed feature extraction method and the MCNN model classifying tweets on NepCOV19Tweets dataset.
arXiv Detail & Related papers (2022-03-19T09:55:05Z) - Classification Of Fake News Headline Based On Neural Networks [0.0]
In this article, we use the dataset, containing news over a period of eighteen years provided by Kaggle platform to classify news headlines.
We choose TF-IDF to extract features and neural network as the classifier, while the evaluation metrics is accuracy.
Our NN model owns the accuracy 0.8622, which is highest accuracy among these four models.
arXiv Detail & Related papers (2022-01-24T21:37:39Z) - Transforming Fake News: Robust Generalisable News Classification Using
Transformers [8.147652597876862]
Using the publicly available ISOT and Combined Corpus datasets, this study explores transformers' abilities to identify fake news.
We propose a novel two-step classification pipeline to remove such articles from both model training and the final deployed inference system.
Experiments over the ISOT and Combined Corpus datasets show that transformers achieve an increase in F1 scores of up to 4.9% for out of distribution generalisation.
arXiv Detail & Related papers (2021-09-20T19:03:16Z) - Transformer-based Language Model Fine-tuning Methods for COVID-19 Fake
News Detection [7.29381091750894]
We propose a novel transformer-based language model fine-tuning approach for these fake news detection.
First, the token vocabulary of individual model is expanded for the actual semantics of professional phrases.
Last, the predicted features extracted by universal language model RoBERTa and domain-specific model CT-BERT are fused by one multiple layer perception to integrate fine-grained and high-level specific representations.
arXiv Detail & Related papers (2021-01-14T09:05:42Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - A Systematic Evaluation: Fine-Grained CNN vs. Traditional CNN
Classifiers [54.996358399108566]
We investigate the performance of the landmark general CNN classifiers, which presented top-notch results on large scale classification datasets.
We compare it against state-of-the-art fine-grained classifiers.
We show an extensive evaluation on six datasets to determine whether the fine-grained classifier is able to elevate the baseline in their experiments.
arXiv Detail & Related papers (2020-03-24T23:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.