CrudeOilNews: An Annotated Crude Oil News Corpus for Event Extraction
- URL: http://arxiv.org/abs/2204.03871v1
- Date: Fri, 8 Apr 2022 06:51:35 GMT
- Title: CrudeOilNews: An Annotated Crude Oil News Corpus for Event Extraction
- Authors: Meisin Lee, Lay-Ki Soon, Eu-Gene Siew, Ly Fie Sugianto
- Abstract summary: CrudeOilNews is a corpus of English Crude Oil news for event extraction.
It is the first of its kind for Commodity News and serve to contribute towards resource building for economic and financial text mining.
- Score: 0.665264113799989
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present CrudeOilNews, a corpus of English Crude Oil news
for event extraction. It is the first of its kind for Commodity News and serve
to contribute towards resource building for economic and financial text mining.
This paper describes the data collection process, the annotation methodology
and the event typology used in producing the corpus. Firstly, a seed set of 175
news articles were manually annotated, of which a subset of 25 news were used
as the adjudicated reference test set for inter-annotator and system
evaluation. Agreement was generally substantial and annotator performance was
adequate, indicating that the annotation scheme produces consistent event
annotations of high quality. Subsequently the dataset is expanded through (1)
data augmentation and (2) Human-in-the-loop active learning. The resulting
corpus has 425 news articles with approximately 11k events annotated. As part
of active learning process, the corpus was used to train basic event extraction
models for machine labeling, the resulting models also serve as a validation or
as a pilot study demonstrating the use of the corpus in machine learning
purposes. The annotated corpus is made available for academic research purpose
at https://github.com/meisin/CrudeOilNews-Corpus.
Related papers
- Fine-Grained Named Entities for Corona News [0.0]
This study proposes a data annotation pipeline to generate training data from corona news articles.
Named entity recognition models are trained on this annotated corpus and then evaluated on test sentences manually annotated by domain experts.
arXiv Detail & Related papers (2024-04-20T18:22:49Z) - RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts [9.457460355411582]
RAAMove is a comprehensive corpus dedicated to the annotation of move structures in Research Article (RA) abstracts.
The corpus is constructed through two stages: first, expert annotators manually annotate high-quality data; then, based on the human-annotated data, a BERT-based model is employed for automatic annotation.
The result is a large-scale and high-quality corpus comprising 33,988 annotated instances.
arXiv Detail & Related papers (2024-03-23T15:43:30Z) - CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks [111.13988772503511]
Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
arXiv Detail & Related papers (2024-02-26T17:35:44Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Faking Fake News for Real Fake News Detection: Propaganda-loaded
Training Data Generation [105.20743048379387]
We propose a novel framework for generating training examples informed by the known styles and strategies of human-authored propaganda.
Specifically, we perform self-critical sequence training guided by natural language inference to ensure the validity of the generated articles.
Our experimental results show that fake news detectors trained on PropaNews are better at detecting human-written disinformation by 3.62 - 7.69% F1 score on two public datasets.
arXiv Detail & Related papers (2022-03-10T14:24:19Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Effective Use of Graph Convolution Network and Contextual Sub-Tree
forCommodity News Event Extraction [1.398696312226463]
This paper proposes an effective use of Graph Convolutional Networks(GCN) with a pruned dependency parse tree, termed contextual sub-tree, for better event ex-traction in commodity news.
Experimental results show the efficiency of the proposed solution, which out-performs existing methods with F1 scores as high as 0.90.
arXiv Detail & Related papers (2021-09-27T03:57:17Z) - MIND - Mainstream and Independent News Documents Corpus [0.7347989843033033]
This paper characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources.
The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories.
arXiv Detail & Related papers (2021-08-13T14:00:12Z) - Cross-context News Corpus for Protest Events related Knowledge Base
Construction [0.15393457051344295]
We describe a gold standard corpus of protest events that comprise of various local and international sources in English.
This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information.
arXiv Detail & Related papers (2020-08-01T22:20:48Z) - Leveraging Declarative Knowledge in Text and First-Order Logic for
Fine-Grained Propaganda Detection [139.3415751957195]
We study the detection of propagandistic text fragments in news articles.
We introduce an approach to inject declarative knowledge of fine-grained propaganda techniques.
arXiv Detail & Related papers (2020-04-29T13:46:15Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.