Related papers: Lessons Learned from a Citizen Science Project for Natural Language Processing

Lessons Learned from a Citizen Science Project for Natural Language Processing

URL: http://arxiv.org/abs/2304.12836v1
Date: Tue, 25 Apr 2023 14:08:53 GMT
Title: Lessons Learned from a Citizen Science Project for Natural Language Processing
Authors: Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, G\"ozde G\"ul \c{S}ahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart de Castilho, Iryna Gurevych
Abstract summary: Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. We conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and attract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues.
Score: 53.48988266271858
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Many Natural Language Processing (NLP) systems use annotated corpora for training and evaluation. However, labeled data is often costly to obtain and scaling annotation projects is difficult, which is why annotation tasks are often outsourced to paid crowdworkers. Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. To investigate whether and how well Citizen Science can be applied in this setting, we conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and attract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues. We summarize lessons learned in the form of guidelines and provide our code and data to aid future work on Citizen Science.

Related papers

Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce [27.918975040084387]
Data in a given language should be viewed as more than a collection of tokens. Good data collection and labeling practices are key to building more human-centered and socially aware technologies.
arXiv Detail & Related papers (2024-10-16T15:51:18Z)
The Nature of NLP: Analyzing Contributions in NLP Papers [77.31665252336157]
We quantitatively investigate what constitutes NLP research by examining research papers. Our findings reveal a rising involvement of machine learning in NLP since the early nineties. In post-2020, there has been a resurgence of focus on language and people.
arXiv Detail & Related papers (2024-09-29T01:29:28Z)
Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP [2.3499129784547663]
This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022.
arXiv Detail & Related papers (2024-07-13T12:01:52Z)
What Can Natural Language Processing Do for Peer Review? [173.8912784451817]
In modern science, peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review are largely text-based, Natural Language Processing has great potential to improve reviewing. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance.
arXiv Detail & Related papers (2024-05-10T16:06:43Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Fairness Certification for Natural Language Processing and Large Language Models [0.0]
We follow a qualitative research approach towards a fairness certification for NLP approaches. We have systematically devised six fairness criteria for NLP, which can be further refined into 18 sub-categories.
arXiv Detail & Related papers (2024-01-02T16:09:36Z)
Situated Natural Language Explanations [54.083715161895036]
Natural language explanations (NLEs) are among the most accessible tools for explaining decisions to humans. Existing NLE research perspectives do not take the audience into account. Situated NLE provides a perspective and facilitates further research on the generation and evaluation of explanations.
arXiv Detail & Related papers (2023-08-27T14:14:28Z)
Collaborating Heterogeneous Natural Language Processing Tasks via Federated Learning [55.99444047920231]
The proposed ATC framework achieves significant improvements compared with various baseline methods. We conduct extensive experiments on six widely-used datasets covering both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks.
arXiv Detail & Related papers (2022-12-12T09:27:50Z)
A Survey of Knowledge Enhanced Pre-trained Language Models [78.56931125512295]
We present a comprehensive review of Knowledge Enhanced Pre-trained Language Models (KE-PLMs) For NLU, we divide the types of knowledge into four categories: linguistic knowledge, text knowledge, knowledge graph (KG) and rule knowledge. The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods.
arXiv Detail & Related papers (2022-11-11T04:29:02Z)
Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages [6.8708103492634836]
Hundreds of underserved languages have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts. We make the case that IGT data can be leveraged successfully provided that target language expertise is available. We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.
arXiv Detail & Related papers (2022-03-17T22:02:25Z)
Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling [2.6734009991058794]
This research paper shows the high potential of NLP applications to enhance the sustainability of projects. We focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. We propose the new task of Automatic UPV classification, which is an extreme multi-class multi-label classification problem.
arXiv Detail & Related papers (2020-04-27T16:51:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.