Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed
Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language
Models
- URL: http://arxiv.org/abs/2403.02121v1
- Date: Mon, 4 Mar 2024 15:27:49 GMT
- Title: Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed
Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language
Models
- Authors: Sargam Yadav (1), Abhishek Kaushik (1) and Kevin McDaid (1) ((1)
Dundalk Institute of Technology, Dundalk)
- Abstract summary: Hate speech detection in mix-code low-resource languages is an active problem area where the use of Large Language Models has proven beneficial.
In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish.
Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Large Language Models (LLMs) has advanced the benchmark in
various Natural Language Processing (NLP) tasks. However, large amounts of
labelled training data are required to train LLMs. Furthermore, data annotation
and training are computationally expensive and time-consuming. Zero and
few-shot learning have recently emerged as viable options for labelling data
using large pre-trained models. Hate speech detection in mix-code low-resource
languages is an active problem area where the use of LLMs has proven
beneficial. In this study, we have compiled a dataset of 100 YouTube comments,
and weakly labelled them for coarse and fine-grained misogyny classification in
mix-code Hinglish. Weak annotation was applied due to the labor-intensive
annotation process. Zero-shot learning, one-shot learning, and few-shot
learning and prompting approaches have then been applied to assign labels to
the comments and compare them to human-assigned labels. Out of all the
approaches, zero-shot classification using the Bidirectional Auto-Regressive
Transformers (BART) large model and few-shot prompting using Generative
Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results
Related papers
- Active Learning for NLP with Large Language Models [4.1967870107078395]
Active Learning (AL) technique can be used to label as few samples as possible to reach a reasonable or similar results.
This work investigates the accuracy and cost of using Large Language Models (LLMs) to label samples on 3 different datasets.
arXiv Detail & Related papers (2024-01-14T21:00:52Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Distinguishability Calibration to In-Context Learning [31.375797763897104]
We propose a method to map a PLM-encoded embedding into a new metric space to guarantee the distinguishability of the resulting embeddings.
We also take the advantage of hyperbolic embeddings to capture the hierarchical relations among fine-grained class-associated token embedding.
arXiv Detail & Related papers (2023-02-13T09:15:00Z) - PERFECT: Prompt-free and Efficient Few-shot Learning with Language
Models [67.3725459417758]
PERFECT is a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting.
We show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning.
Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods.
arXiv Detail & Related papers (2022-04-03T22:31:25Z) - ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling [57.80052276304937]
This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task.
We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.
arXiv Detail & Related papers (2022-01-04T20:08:17Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling [7.310390479801139]
Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties.
Our work opens up opportunities for developing DA models exploiting only MSA resources.
arXiv Detail & Related papers (2021-01-12T21:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.