Chinese Offensive Language Detection:Current Status and Future Directions
- URL: http://arxiv.org/abs/2403.18314v3
- Date: Fri, 29 Mar 2024 18:48:35 GMT
- Title: Chinese Offensive Language Detection:Current Status and Future Directions
- Authors: Yunze Xiao, Houda Bouamor, Wajdi Zaghouani,
- Abstract summary: This paper provides a comprehensive overview of offensive language detection in Chinese, examining current benchmarks and approaches.
The primary objective of this survey is to explore the existing techniques and identify potential avenues for further research.
- Score: 2.1357786131968637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the considerable efforts being made to monitor and regulate user-generated content on social media platforms, the pervasiveness of offensive language, such as hate speech or cyberbullying, in the digital space remains a significant challenge. Given the importance of maintaining a civilized and respectful online environment, there is an urgent and growing need for automatic systems capable of detecting offensive speech in real time. However, developing effective systems for processing languages such as Chinese presents a significant challenge, owing to the language's complex and nuanced nature, which makes it difficult to process automatically. This paper provides a comprehensive overview of offensive language detection in Chinese, examining current benchmarks and approaches and highlighting specific models and tools for addressing the unique challenges of detecting offensive language in this complex language. The primary objective of this survey is to explore the existing techniques and identify potential avenues for further research that can address the cultural and linguistic complexities of Chinese.
Related papers
- A comprehensive cross-language framework for harmful content detection
with the aid of sentiment analysis [0.356008609689971]
This study introduces, for the first time, a detailed framework adaptable to any language.
A key component of the framework is the development of a general and detailed annotation guideline.
The integration of sentiment analysis represents a novel approach to enhancing harmful language detection.
arXiv Detail & Related papers (2024-03-02T17:13:47Z) - Language Detection for Transliterated Content [0.0]
We study the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages.
This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English.
The research pioneers innovative approaches to identify and convert transliterated text.
arXiv Detail & Related papers (2024-01-09T15:40:54Z) - Detection of Offensive and Threatening Online Content in a Low Resource
Language [0.0]
Hausa is a major Chadic language, spoken by over 100 million people in Africa.
Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language.
arXiv Detail & Related papers (2023-11-17T14:08:44Z) - Towards Possibilities & Impossibilities of AI-generated Text Detection:
A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses.
Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs.
To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Societal Biases in Language Generation: Progress and Challenges [43.06301135908934]
Language generation presents unique challenges in terms of direct user interaction and the structure of decoding techniques.
We present a survey on societal biases in language generation, focusing on how techniques contribute to biases and on progress towards bias analysis and mitigation.
Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques.
arXiv Detail & Related papers (2021-05-10T00:17:33Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Crossing the Conversational Chasm: A Primer on Multilingual
Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry.
Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.