A Tweet-based Dataset for Company-Level Stock Return Prediction
- URL: http://arxiv.org/abs/2006.09723v1
- Date: Wed, 17 Jun 2020 08:55:11 GMT
- Title: A Tweet-based Dataset for Company-Level Stock Return Prediction
- Authors: Karolina Sowinska and Pranava Madhyastha
- Abstract summary: We present a dataset that allows for company-level analysis of tweet based impact on one-, two-, three-, and seven-day stock returns.
Our dataset consists of 862, 231 labelled instances from twitter in English, we also release a cleaned subset of 85, 176 labelled instances to the community.
- Score: 8.606705921815985
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Public opinion influences events, especially related to stock market
movement, in which a subtle hint can influence the local outcome of the market.
In this paper, we present a dataset that allows for company-level analysis of
tweet based impact on one-, two-, three-, and seven-day stock returns. Our
dataset consists of 862, 231 labelled instances from twitter in English, we
also release a cleaned subset of 85, 176 labelled instances to the community.
We also provide baselines using standard machine learning algorithms and a
multi-view learning based approach that makes use of different types of
features. Our dataset, scripts and models are publicly available at:
https://github.com/ImperialNLP/stockreturnpred.
Related papers
- A Simple Baseline for Predicting Events with Auto-Regressive Tabular Transformers [70.20477771578824]
Existing approaches to event prediction include time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance.
We propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective.
Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases.
arXiv Detail & Related papers (2024-10-14T15:59:16Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - OPSD: an Offensive Persian Social media Dataset and its baseline evaluations [2.356562319390226]
This paper introduces two offensive datasets for Persian language.
The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling.
The obtained F1-scores for the three-class and two-class versions of the dataset were 76.9% and 89.9% for XLM-RoBERTa, respectively.
arXiv Detail & Related papers (2024-04-08T14:08:56Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Cross-Domain Shopping and Stock Trend Analysis [0.0]
This paper presents a cross-domain trend analysis that aims to identify and analyze the relationships between stock prices, stock news on Twitter, and users' behaviors on e-commerce websites.
The analysis is based on three datasets: a US stock dataset, a stock tweets dataset, and an e-commerce behavior dataset.
arXiv Detail & Related papers (2022-12-23T18:21:28Z) - Named Entity Recognition in Twitter: A Dataset and Analysis on
Short-Term Temporal Shifts [15.108940488494587]
We focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7.
The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis.
In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data.
arXiv Detail & Related papers (2022-10-07T19:58:47Z) - A Novel Dataset for Evaluating and Alleviating Domain Shift for Human
Detection in Agricultural Fields [59.035813796601055]
We evaluate the impact of domain shift on human detection models trained on well known object detection datasets when deployed on data outside the distribution of the training set.
We introduce the OpenDR Humans in Field dataset, collected in the context of agricultural robotics applications, using the Robotti platform.
arXiv Detail & Related papers (2022-09-27T07:04:28Z) - Fair Group-Shared Representations with Normalizing Flows [68.29997072804537]
We develop a fair representation learning algorithm which is able to map individuals belonging to different groups in a single group.
We show experimentally that our methodology is competitive with other fair representation learning algorithms.
arXiv Detail & Related papers (2022-01-17T10:49:49Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep
Learning Benchmarks [5.937482215664902]
Social media content is often too noisy for direct use in any application.
It is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making.
We present a new large-scale dataset with 77K human-labeled tweets, sampled from a pool of 24 million tweets across 19 disaster events.
arXiv Detail & Related papers (2021-04-07T12:29:36Z) - Sentiment Analysis on Social Media Content [0.0]
The aim of this paper is to present a model that can perform sentiment analysis of real data collected from Twitter.
Data in Twitter is highly unstructured which makes it difficult to analyze.
Our proposed model is different from prior work in this field because it combined the use of supervised and unsupervised machine learning algorithms.
arXiv Detail & Related papers (2020-07-04T17:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.