Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks
- URL: http://arxiv.org/abs/2304.01331v1
- Date: Mon, 3 Apr 2023 19:51:00 GMT
- Title: Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks
- Authors: Andrew Halterman, Philip A. Schrodt, Andreas Beger, Benjamin E.
Bagozzi, Grace I. Scarborough
- Abstract summary: Event data, or structured records of who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics.
This paper describes a bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP)
We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS.
- Score: 4.06061049778407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event data, or structured records of ``who did what to whom'' that are
automatically extracted from text, is an important source of data for scholars
of international politics. The high cost of developing new event datasets,
especially using automated systems that rely on hand-built dictionaries, means
that most researchers draw on large, pre-existing datasets such as ICEWS rather
than developing tailor-made event datasets optimized for their specific
research question. This paper describes a ``bag of tricks'' for efficient,
custom event data production, drawing on recent advances in natural language
processing (NLP) that allow researchers to rapidly produce customized event
datasets. The paper introduces techniques for training an event category
classifier with active learning, identifying actors and the recipients of
actions in text using large language models and standard machine learning
classifiers and pretrained ``question-answering'' models from NLP, and
resolving mentions of actors to their Wikipedia article to categorize them. We
describe how these techniques produced the new POLECAT global event dataset
that is intended to replace ICEWS, along with examples of how scholars can
quickly produce smaller, custom event datasets. We publish example code and
models to implement our new techniques.
Related papers
- CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction [11.458594744457521]
Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources.
Most datasets or shared tasks focus on extracting ADEs from a particular type of text.
Domain generalisation - the ability of a machine learning model to perform well on new, unseen domains (text types) - is under-explored.
We build a benchmark for adverse drug event extraction, which we named MultiADE.
arXiv Detail & Related papers (2024-05-28T09:57:28Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Actuarial Applications of Natural Language Processing Using
Transformers: Case Studies for Using Text Features in an Actuarial Context [0.0]
This tutorial demonstrates to incorporate text data into actuarial classification and regression tasks.
The main focus is on methods employing transformer-based models.
The case studies tackle challenges related to a multi-lingual setting and long input sequences.
arXiv Detail & Related papers (2022-06-04T15:39:30Z) - Robust Event Classification Using Imperfect Real-world PMU Data [58.26737360525643]
We study robust event classification using imperfect real-world phasor measurement unit (PMU) data.
We develop a novel machine learning framework for training robust event classifiers.
arXiv Detail & Related papers (2021-10-19T17:41:43Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Iterative Data Programming for Expanding Text Classification Corpora [9.152045698511506]
Real-world text classification tasks often require many labeled training examples that are expensive to obtain.
Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly.
We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models.
arXiv Detail & Related papers (2020-02-04T17:12:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.