Construction of Large-Scale Misinformation Labeled Datasets from Social
Media Discourse using Label Refinement
- URL: http://arxiv.org/abs/2202.12413v1
- Date: Thu, 24 Feb 2022 23:10:29 GMT
- Title: Construction of Large-Scale Misinformation Labeled Datasets from Social
Media Discourse using Label Refinement
- Authors: Karishma Sharma, Emilio Ferrara, Yan Liu
- Abstract summary: We propose to leverage news-source credibility labels as weak labels for social media posts.
The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels.
The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.
- Score: 16.754951815543006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malicious accounts spreading misinformation has led to widespread false and
misleading narratives in recent times, especially during the COVID-19 pandemic,
and social media platforms struggle to eliminate these contents rapidly. This
is because adapting to new domains requires human intensive fact-checking that
is slow and difficult to scale. To address this challenge, we propose to
leverage news-source credibility labels as weak labels for social media posts
and propose model-guided refinement of labels to construct large-scale, diverse
misinformation labeled datasets in new domains. The weak labels can be
inaccurate at the article or social media post level where the stance of the
user does not align with the news source or article credibility. We propose a
framework to use a detection model self-trained on the initial weak labels with
uncertainty sampling based on entropy in predictions of the model to identify
potentially inaccurate labels and correct for them using self-supervision or
relabeling. The framework will incorporate social context of the post in terms
of the community of its associated user for surfacing inaccurate labels towards
building a large-scale dataset with minimum human effort. To provide labeled
datasets with distinction of misleading narratives where information might be
missing significant context or has inaccurate ancillary details, the proposed
framework will use the few labeled samples as class prototypes to separate high
confidence samples into false, unproven, mixture, mostly false, mostly true,
true, and debunk information. The approach is demonstrated for providing a
large-scale misinformation dataset on COVID-19 vaccines.
Related papers
- Suicide Risk Assessment on Social Media with Semi-Supervised Learning [20.193174124912282]
We propose a semi-supervised framework that leverages labeled and unlabeled data.
We manually verify a subset of the pseudo-labeled data that was not predicted unanimously across multiple trials of pseudo-label generation.
By leveraging partially validated pseudo-labeled data in addition to ground-truth labeled data, we substantially improve our model's ability to assess suicide risk from social media posts.
arXiv Detail & Related papers (2024-11-18T02:43:05Z) - Virtual Category Learning: A Semi-Supervised Learning Method for Dense
Prediction with Extremely Limited Labels [63.16824565919966]
This paper proposes to use confusing samples proactively without label correction.
A Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation.
Our intriguing findings highlight the usage of VC learning in dense vision tasks.
arXiv Detail & Related papers (2023-12-02T16:23:52Z) - ScarceNet: Animal Pose Estimation with Scarce Annotations [74.48263583706712]
ScarceNet is a pseudo label-based approach to generate artificial labels for the unlabeled images.
We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin.
arXiv Detail & Related papers (2023-03-27T09:15:53Z) - Losses over Labels: Weakly Supervised Learning via Direct Loss
Construction [71.11337906077483]
Programmable weak supervision is a growing paradigm within machine learning.
We propose Losses over Labels (LoL) as it creates losses directly from ofs without going through the intermediate step of a label.
We show that LoL improves upon existing weak supervision methods on several benchmark text and image classification tasks.
arXiv Detail & Related papers (2022-12-13T22:29:14Z) - Improved Adaptive Algorithm for Scalable Active Learning with Weak
Labeler [89.27610526884496]
Weak Labeler Active Cover (WL-AC) is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy.
We show its effectiveness on the corrupted-MNIST dataset by significantly reducing the number of labels while keeping the same accuracy as in passive learning.
arXiv Detail & Related papers (2022-11-04T02:52:54Z) - Label Noise-Resistant Mean Teaching for Weakly Supervised Fake News
Detection [93.6222609806278]
We propose a novel label noise-resistant mean teaching approach (LNMT) for weakly supervised fake news detection.
LNMT leverages unlabeled news and feedback comments of users to enlarge the amount of training data.
LNMT establishes a mean teacher framework equipped with label propagation and label reliability estimation.
arXiv Detail & Related papers (2022-06-10T16:01:58Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Labeled Data Generation with Inexact Supervision [33.110134862501546]
In this paper, we study a novel problem of labeled data generation with inexact supervision.
We propose a novel generative framework named as ADDES which can synthesize high-quality labeled data for target classification tasks.
arXiv Detail & Related papers (2021-06-08T22:22:26Z) - OpinionRank: Extracting Ground Truth Labels from Unreliable Expert
Opinions with Graph-Based Spectral Ranking [2.1930130356902207]
crowdsourcing has emerged as a popular, inexpensive, and efficient data mining solution for performing distributed label collection.
We propose OpinionRank, a model-free, interpretable, graph-based spectral algorithm for integrating crowdsourced annotations into reliable labels.
Our experiments show that OpinionRank performs favorably when compared against more highly parameterized algorithms.
arXiv Detail & Related papers (2021-02-11T08:12:44Z) - Limitations of weak labels for embedding and tagging [0.0]
Many datasets and approaches in ambient sound analysis use weakly labeled data.Weak labels are employed because annotating every data sample with a strong label is too expensive.Yet, their impact on the performance in comparison to strong labels remains unclear.
In this paper, we formulate a supervised learning problem which involves weak labels.We create a dataset that focuses on the difference between strong and weak labels as opposed to other challenges.
arXiv Detail & Related papers (2020-02-05T08:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.