Curating corpora with classifiers: A case study of clean energy
sentiment online
- URL: http://arxiv.org/abs/2305.03092v2
- Date: Wed, 10 May 2023 02:39:55 GMT
- Title: Curating corpora with classifiers: A case study of clean energy
sentiment online
- Authors: Michael V. Arnold, Peter Sheridan Dodds, Christopher M. Danforth
- Abstract summary: Large-scale corpora of social media posts contain broad public opinion.
Surveys can be expensive to run and lag public opinion by days or weeks.
We propose a method for rapidly selecting the best corpus of relevant documents for analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Well curated, large-scale corpora of social media posts containing broad
public opinion offer an alternative data source to complement traditional
surveys. While surveys are effective at collecting representative samples and
are capable of achieving high accuracy, they can be both expensive to run and
lag public opinion by days or weeks. Both of these drawbacks could be overcome
with a real-time, high volume data stream and fast analysis pipeline. A central
challenge in orchestrating such a data pipeline is devising an effective method
for rapidly selecting the best corpus of relevant documents for analysis.
Querying with keywords alone often includes irrelevant documents that are not
easily disambiguated with bag-of-words natural language processing methods.
Here, we explore methods of corpus curation to filter irrelevant tweets using
pre-trained transformer-based models, fine-tuned for our binary classification
task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95.
The low cost and high performance of fine-tuning such a model suggests that our
approach could be of broad benefit as a pre-processing step for social media
datasets with uncertain corpus boundaries.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework
for Knowledge Graph Link Predictors [4.31947784387967]
In Knowledge Graphs on a larger scale, the ranking process rapidly becomes heavy.
Previous approaches used random sampling of entities to assess the quality of links predicted or suggested by a method.
We show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes.
We propose a framework that uses relational recommenders to guide the selection of candidates for evaluation.
arXiv Detail & Related papers (2024-01-25T15:44:46Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z) - A combined approach to the analysis of speech conversations in a contact
center domain [2.575030923243061]
We describe an experimentation with a speech analytics process for an Italian contact center, that deals with call recordings extracted from inbound or outbound flows.
First, we illustrate in detail the development of an in-house speech-to-text solution, based on Kaldi framework.
Then, we evaluate and compare different approaches to the semantic tagging of call transcripts.
Finally, a decision tree inducer, called J48S, is applied to the problem of tagging.
arXiv Detail & Related papers (2022-03-12T10:03:20Z) - Towards A Reliable Ground-Truth For Biased Language Detection [3.2202224129197745]
Existing methods to detect bias mostly rely on annotated data to train machine learning models.
We evaluate data collection options and compare labels obtained from two popular crowdsourcing platforms.
We conclude that detailed annotator training increases data quality, improving the performance of existing bias detection systems.
arXiv Detail & Related papers (2021-12-14T14:13:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Tradeoffs in Sentence Selection Techniques for Open-Domain Question
Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question.
We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.