Related papers: Global Contentious Politics Database (GLOCON) Annotation Manuals

Global Contentious Politics Database (GLOCON) Annotation Manuals

URL: http://arxiv.org/abs/2206.10299v1
Date: Tue, 17 May 2022 13:16:50 GMT
Title: Global Contentious Politics Database (GLOCON) Annotation Manuals
Authors: F{\i}rat Duru\c{s}an, Ali H\"urriyeto\u{g}lu, Erdem Y\"or\"uk, Osman Mutlu, \c{C}a\u{g}r{\i} Yoltar, Burak G\"urel, Alvaro Comin
Abstract summary: The GLOCON Gold Standard Corpus (GSC) contains news articles from multiple sources from each focus country. The articles in the GSC were manually coded by skilled annotators in both classification and extraction tasks. This document lays out the rules according to which annotators code the news articles.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The database creation utilized automated text processing tools that detect if a news article contains a protest event, locate protest information within the article, and extract pieces of information regarding the detected protest events. The basis of training and testing the automated tools is the GLOCON Gold Standard Corpus (GSC), which contains news articles from multiple sources from each focus country. The articles in the GSC were manually coded by skilled annotators in both classification and extraction tasks with the utmost accuracy and consistency that automated tool development demands. In order to assure these, the annotation manuals in this document lay out the rules according to which annotators code the news articles. Annotators refer to the manuals at all times for all annotation tasks and apply the rules that they contain. The content of the annotation manual is built on the general principles and standards of linguistic annotation laid out in other prominent annotation manuals such as ACE, CAMEO, and TimeML. These principles, however, have been adapted or rather modified heavily to accommodate the social scientific concepts and variables employed in the EMW project. The manual has been molded throughout a long trial and error process that accompanied the annotation of the GSC. It owes much of its current shape to the meticulous work and invaluable feedback provided by highly specialized teams of annotators, whose diligence and expertise greatly increased the quality of the corpus.

Related papers

Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines [0.0]
Professional societies often publish curriculum guidelines to help programs align their content to international standards.<n>In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program.<n>It is difficult for program administrators to assess how much of the guidelines is being covered by a CS program.<n>We propose using Natural Language Processing techniques to accelerate the process.
arXiv Detail & Related papers (2026-02-03T19:24:18Z)
ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark [50.89916747049978]
Existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope.<n>We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data.<n>We propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-12T17:57:05Z)
SmartNote: An LLM-Powered, Personalised Release Note Generator That Just Works [5.9029064046556545]
Many developers view the process of writing software release notes as a tedious and dreadful task.<n>We propose SmartNote, a novel and widely applicable release note generation approach.<n>It produces high-quality, contextually personalised release notes using LLM technology.
arXiv Detail & Related papers (2025-05-23T14:45:44Z)
Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples. Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z)
Magic Markup: Maintaining Document-External Markup with an LLM [1.0538052824177144]
We present a system that re-tags modified programs, enabling rich annotations to automatically follow code as it evolves. Our system achieves an accuracy of 90% on our benchmarks and can replace a document's tags in parallel at a rate of 5 seconds per tag. While there remains significant room for improvement, we find performance reliable enough to justify further exploration of applications.
arXiv Detail & Related papers (2024-03-06T05:40:31Z)
Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text. Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references. We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z)
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z)
FETA: Towards Specializing Foundation Models for Expert Task Applications [49.57393504125937]
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. We show in this paper that FMs still have poor out-of-the-box performance on expert tasks. We propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation.
arXiv Detail & Related papers (2022-09-08T08:47:57Z)
SciAnnotate: A Tool for Integrating Weak Labeling Sources for Sequence Labeling [55.71459234749639]
SciAnnotate is a web-based tool for text annotation called SciAnnotate, which stands for scientific annotation tool. Our tool provides users with multiple user-friendly interfaces for creating weak labels. In this study, we take multi-source weak label denoising as an example, we utilized a Bertifying Conditional Hidden Markov Model to denoise the weak label generated by our tool.
arXiv Detail & Related papers (2022-08-07T19:18:13Z)
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives. We propose a unifying perspective based on the nature of information change in NLG tasks. We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z)
Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
Cross-context News Corpus for Protest Events related Knowledge Base Construction [0.15393457051344295]
We describe a gold standard corpus of protest events that comprise of various local and international sources in English. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information.
arXiv Detail & Related papers (2020-08-01T22:20:48Z)
GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction. GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases. It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.