Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election
- URL: http://arxiv.org/abs/2412.13098v1
- Date: Tue, 17 Dec 2024 17:08:35 GMT
- Title: Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election
- Authors: Roberto Mondini, Neema Kotonya, Robert L. Logan IV, Elizabeth M Olson, Angela Oduor Lungati, Daniel Duke Odongo, Tim Ombasa, Hemank Lamba, Aoife Cahill, Joel R. Tetreault, Alejandro Jaimes,
- Abstract summary: We present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election.
We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.
- Score: 49.35115948941981
- License:
- Abstract: Online reporting platforms have enabled citizens around the world to collectively share their opinions and report in real time on events impacting their local communities. Systematically organizing (e.g., categorizing by attributes) and geotagging large amounts of crowdsourced information is crucial to ensuring that accurate and meaningful insights can be drawn from this data and used by policy makers to bring about positive change. These tasks, however, typically require extensive manual annotation efforts. In this paper we present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election containing mentions of election-related issues such as official misconduct, vote count irregularities, and acts of violence. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.
Related papers
- Analyzing the Impact of Fake News on the Anticipated Outcome of the 2024
Election Ahead of Time [7.1970442944315245]
Despite increasing awareness and research around fake news, there is still a significant need for datasets that specifically target racial slurs and biases within North American political speeches.
This study introduces a comprehensive dataset that illuminates these critical aspects of misinformation.
arXiv Detail & Related papers (2023-12-01T20:14:16Z) - Leveraging Large Language Models for Topic Classification in the Domain
of Public Affairs [65.9077733300329]
Large Language Models (LLMs) have the potential to greatly enhance the analysis of public affairs documents.
LLMs can be of great use to process domain-specific documents, such as those in the domain of public affairs.
arXiv Detail & Related papers (2023-06-05T13:35:01Z) - Lessons Learned from a Citizen Science Project for Natural Language
Processing [53.48988266271858]
Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP.
We conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset.
Our results show that this can yield high-quality annotations and attract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues.
arXiv Detail & Related papers (2023-04-25T14:08:53Z) - Design and analysis of tweet-based election models for the 2021 Mexican
legislative election [55.41644538483948]
We use a dataset of 15 million election-related tweets in the six months preceding election day.
We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods.
arXiv Detail & Related papers (2023-01-02T12:40:05Z) - Fast Few shot Self-attentive Semi-supervised Political Inclination
Prediction [12.472629584751509]
It is increasingly common now for policymakers/journalists to create online polls on social media to understand the political leanings of people in specific locations.
We introduce a self-attentive semi-supervised framework for political inclination detection to further that objective.
We found that the model is highly efficient even in resource-constrained settings.
arXiv Detail & Related papers (2022-09-21T12:07:16Z) - FacTeR-Check: Semi-automated fact-checking through Semantic Similarity
and Natural Language Inference [61.068947982746224]
FacTeR-Check enables retrieving fact-checked information, unchecked claims verification and tracking dangerous information over social media.
The architecture is validated using a new dataset called NLI19-SP that is publicly released with COVID-19 related hoaxes and tweets from Spanish social media.
Our results show state-of-the-art performance on the individual benchmarks, as well as producing useful analysis of the evolution over time of 61 different hoaxes.
arXiv Detail & Related papers (2021-10-27T15:44:54Z) - TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity,
Geo, and Gender Labels [5.267993069044648]
This work presents TBCOV, a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year.
Several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities, mentions of persons, organizations, locations, user types, and gender information.
Our sentiment and trend analyses reveal interesting insights and confirm TBCOV's broad coverage of important topics.
arXiv Detail & Related papers (2021-10-04T06:17:12Z) - Leveraging Administrative Data for Bias Audits: Assessing Disparate
Coverage with Mobility Data for COVID-19 Policy [61.60099467888073]
We show how linking administrative data can enable auditing mobility data for bias.
We show that older and non-white voters are less likely to be captured by mobility data.
We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
arXiv Detail & Related papers (2020-11-14T02:04:14Z) - CovidNet: To Bring Data Transparency in the Era of COVID-19 [9.808021836153712]
This paper presents CovidNet, a COVID-19 tracking project associated with a large scale epidemic dataset.
CovidNet is the only platform providing real-time global case information of more than 4,124 sub-divisions from over 27 countries worldwide.
The accuracy and freshness of the dataset is a result of the painstaking efforts from our voluntary teamwork, crowd-sourcing channels, and automated data pipelines.
arXiv Detail & Related papers (2020-05-22T00:05:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.