Curating corpora with classifiers: A case study of clean energy
sentiment online
- URL: http://arxiv.org/abs/2305.03092v2
- Date: Wed, 10 May 2023 02:39:55 GMT
- Title: Curating corpora with classifiers: A case study of clean energy
sentiment online
- Authors: Michael V. Arnold, Peter Sheridan Dodds, Christopher M. Danforth
- Abstract summary: Large-scale corpora of social media posts contain broad public opinion.
Surveys can be expensive to run and lag public opinion by days or weeks.
We propose a method for rapidly selecting the best corpus of relevant documents for analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Well curated, large-scale corpora of social media posts containing broad
public opinion offer an alternative data source to complement traditional
surveys. While surveys are effective at collecting representative samples and
are capable of achieving high accuracy, they can be both expensive to run and
lag public opinion by days or weeks. Both of these drawbacks could be overcome
with a real-time, high volume data stream and fast analysis pipeline. A central
challenge in orchestrating such a data pipeline is devising an effective method
for rapidly selecting the best corpus of relevant documents for analysis.
Querying with keywords alone often includes irrelevant documents that are not
easily disambiguated with bag-of-words natural language processing methods.
Here, we explore methods of corpus curation to filter irrelevant tweets using
pre-trained transformer-based models, fine-tuned for our binary classification
task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95.
The low cost and high performance of fine-tuning such a model suggests that our
approach could be of broad benefit as a pre-processing step for social media
datasets with uncertain corpus boundaries.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Neural Passage Quality Estimation for Static Pruning [23.662724916799004]
We explore whether neural networks can effectively predict which of a document's passages are unlikely to be relevant to any query submitted to the search engine.
We find that our novel methods for estimating passage quality allow passage corpora to be pruned considerably.
This work sets the stage for developing more advanced neural "learning-what-to-index" methods.
arXiv Detail & Related papers (2024-07-16T20:47:54Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z) - A combined approach to the analysis of speech conversations in a contact
center domain [2.575030923243061]
We describe an experimentation with a speech analytics process for an Italian contact center, that deals with call recordings extracted from inbound or outbound flows.
First, we illustrate in detail the development of an in-house speech-to-text solution, based on Kaldi framework.
Then, we evaluate and compare different approaches to the semantic tagging of call transcripts.
Finally, a decision tree inducer, called J48S, is applied to the problem of tagging.
arXiv Detail & Related papers (2022-03-12T10:03:20Z) - Tradeoffs in Sentence Selection Techniques for Open-Domain Question
Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question.
We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.