Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative
Dataset to Fight Online Hate Speech
- URL: http://arxiv.org/abs/2107.08720v1
- Date: Mon, 19 Jul 2021 09:45:54 GMT
- Title: Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative
Dataset to Fight Online Hate Speech
- Authors: Margherita Fanton, Helena Bonaldi, Serra Sinem Tekiroglu, Marco
Guerini
- Abstract summary: Undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities.
We propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively.
Results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection.
- Score: 10.323063834827416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Undermining the impact of hateful content with informed and non-aggressive
responses, called counter narratives, has emerged as a possible solution for
having healthier online communities. Thus, some NLP studies have started
addressing the task of counter narrative generation. Although such studies have
made an effort to build hate speech / counter narrative (HS/CN) datasets for
neural generation, they fall short in reaching either high-quality and/or
high-quantity. In this paper, we propose a novel human-in-the-loop data
collection methodology in which a generative language model is refined
iteratively by using its own data from the previous loops to generate new
training samples that experts review and/or post-edit. Our experiments
comprised several loops including dynamic variations. Results show that the
methodology is scalable and facilitates diverse, novel, and cost-effective data
collection. To our knowledge, the resulting dataset is the only expert-based
multi-target HS/CN dataset available to the community.
Related papers
- CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs [5.89889361990138]
Large language models (LLMs) have demonstrated remarkable performance in diverse tasks using zero-shot and few-shot prompting.
In this work, we tackle the challenge of generating datasets with high diversity, upon which a student model is trained for downstream tasks.
Taking the route of decoding-time guidance-based approaches, we propose Corr Synth, which generates data that is more diverse and faithful to the input prompt using a correlated sampling strategy.
arXiv Detail & Related papers (2024-11-13T12:09:23Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation [91.16551253297588]
COunterfactual Generation via Retrieval and Editing (CORE) is a retrieval-augmented generation framework for creating diverse counterfactual perturbations for training.
CORE first performs a dense retrieval over a task-related unlabeled text corpus using a learned bi-encoder.
CORE then incorporates these into prompts to a large language model with few-shot learning capabilities, for counterfactual editing.
arXiv Detail & Related papers (2022-10-10T17:45:38Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case
Study of Political Public Figures [7.52579126252489]
We propose a new Multi-task Learning pipeline that utilizes MTL to train simultaneously across multiple hate speech datasets.
We show strong results when examining generalization error in train-test splits and substantial improvements when predicting on previously unseen datasets.
We also assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures.
arXiv Detail & Related papers (2022-08-22T21:13:38Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Does Putting a Linguist in the Loop Improve NLU Data Collection? [34.34874979524489]
Crowdsourcing NLP datasets contain systematic gaps and biases that are identified only after data collection is complete.
We take natural language inference as a test case and ask whether it is beneficial to put a linguist in the loop' during data collection.
arXiv Detail & Related papers (2021-04-15T00:31:10Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.