Integrating Crowdsourcing and Active Learning for Classification of
Work-Life Events from Tweets
- URL: http://arxiv.org/abs/2003.12139v2
- Date: Thu, 2 Apr 2020 15:30:35 GMT
- Title: Integrating Crowdsourcing and Active Learning for Classification of
Work-Life Events from Tweets
- Authors: Yunpeng Zhao, Mattia Prosperi, Tianchen Lyu, Yi Guo, Jiang Bian
- Abstract summary: Social media data are unstructured and must undergo complex manipulation for research use.
We devised a crowdsourcing pipeline combined with active learning strategies.
Results show that crowdsourcing is useful to create high-quality annotations and active learning helps in reducing the number of required tweets.
- Score: 9.137917522951277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media, especially Twitter, is being increasingly used for research
with predictive analytics. In social media studies, natural language processing
(NLP) techniques are used in conjunction with expert-based, manual and
qualitative analyses. However, social media data are unstructured and must
undergo complex manipulation for research use. The manual annotation is the
most resource and time-consuming process that multiple expert raters have to
reach consensus on every item, but is essential to create gold-standard
datasets for training NLP-based machine learning classifiers. To reduce the
burden of the manual annotation, yet maintaining its reliability, we devised a
crowdsourcing pipeline combined with active learning strategies. We
demonstrated its effectiveness through a case study that identifies job loss
events from individual tweets. We used Amazon Mechanical Turk platform to
recruit annotators from the Internet and designed a number of quality control
measures to assure annotation accuracy. We evaluated 4 different active
learning strategies (i.e., least confident, entropy, vote entropy, and
Kullback-Leibler divergence). The active learning strategies aim at reducing
the number of tweets needed to reach a desired performance of automated
classification. Results show that crowdsourcing is useful to create
high-quality annotations and active learning helps in reducing the number of
required tweets, although there was no substantial difference among the
strategies tested.
Related papers
- KBAlign: Efficient Self Adaptation on Specific Knowledge Bases [75.78948575957081]
Large language models (LLMs) usually rely on retrieval-augmented generation to exploit knowledge materials in an instant manner.
We propose KBAlign, an approach designed for efficient adaptation to downstream tasks involving knowledge bases.
Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently.
arXiv Detail & Related papers (2024-11-22T08:21:03Z) - Active Learning to Guide Labeling Efforts for Question Difficulty Estimation [1.0514231683620516]
Transformer-based neural networks achieve state-of-the-art performance, primarily through supervised methods but with an isolated study in unsupervised learning.
This work bridges the research gap by exploring active learning for QDE, a supervised human-in-the-loop approach.
Experiments demonstrate that active learning with PowerVariance acquisition achieves a performance close to fully supervised models after labeling only 10% of the training data.
arXiv Detail & Related papers (2024-09-14T02:02:42Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Active Learning for Abstractive Text Summarization [50.79416783266641]
We propose the first effective query strategy for Active Learning in abstractive text summarization.
We show that using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores.
arXiv Detail & Related papers (2023-01-09T10:33:14Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z) - Active Learning of Ordinal Embeddings: A User Study on Football Data [4.856635699699126]
Humans innately measure distance between instances in an unlabeled dataset using an unknown similarity function.
This work uses deep metric learning to learn these user-defined similarity functions from few annotations for a large football trajectory dataset.
arXiv Detail & Related papers (2022-07-26T07:55:23Z) - Online Continual Learning with Natural Distribution Shifts: An Empirical
Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy.
In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online.
We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Mining Implicit Relevance Feedback from User Behavior for Web Question
Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance.
Our approach significantly improves the accuracy of passage ranking without extra human labeled data.
In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.