Cross-Cultural Transfer Learning for Chinese Offensive Language
Detection
- URL: http://arxiv.org/abs/2303.17927v1
- Date: Fri, 31 Mar 2023 09:50:07 GMT
- Title: Cross-Cultural Transfer Learning for Chinese Offensive Language
Detection
- Authors: Li Zhou, Laura Cabello, Yong Cao, Daniel Hershcovich
- Abstract summary: We aim to investigate the impact of transfer learning using offensive language detection data from different cultural backgrounds.
We find that culture-specific biases in what is considered offensive negatively impact the transferability of language models.
In a few-shot learning scenario, however, our study shows promising prospects for non-English offensive language detection with limited resources.
- Score: 9.341003339029221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting offensive language is a challenging task. Generalizing across
different cultures and languages becomes even more challenging: besides
lexical, syntactic and semantic differences, pragmatic aspects such as cultural
norms and sensitivities, which are particularly relevant in this context, vary
greatly. In this paper, we target Chinese offensive language detection and aim
to investigate the impact of transfer learning using offensive language
detection data from different cultural backgrounds, specifically Korean and
English. We find that culture-specific biases in what is considered offensive
negatively impact the transferability of language models (LMs) and that LMs
trained on diverse cultural data are sensitive to different features in Chinese
offensive language detection. In a few-shot learning scenario, however, our
study shows promising prospects for non-English offensive language detection
with limited resources. Our findings highlight the importance of cross-cultural
transfer learning in improving offensive language detection and promoting
inclusive digital spaces.
Related papers
- The Echoes of Multilinguality: Tracing Cultural Value Shifts during LM Fine-tuning [23.418656688405605]
We study how languages can exert influence on the cultural values encoded for different test languages, by studying how such values are revised during fine-tuning.
Lastly, we use a training data attribution method to find patterns in the fine-tuning examples, and the languages that they come from, that tend to instigate value shifts.
arXiv Detail & Related papers (2024-05-21T12:55:15Z) - CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features [19.72091739119933]
Our study delves into the intersection of cultural features and transfer learning effectiveness.
Based on these results, we advocate for the integration of cultural information into datasets.
Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.
arXiv Detail & Related papers (2023-10-10T09:29:38Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Relationship of the language distance to English ability of a country [0.0]
We introduce a novel solution to measure the semantic dissimilarity between languages.
We empirically examine the effectiveness of the proposed semantic language distance.
The experimental results show that the language distance demonstrates negative influence on a country's average English ability.
arXiv Detail & Related papers (2022-11-15T02:40:00Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.