Censorship of Online Encyclopedias: Implications for NLP Models
- URL: http://arxiv.org/abs/2101.09294v1
- Date: Fri, 22 Jan 2021 19:09:53 GMT
- Title: Censorship of Online Encyclopedias: Implications for NLP Models
- Authors: Eddie Yang, Margaret E. Roberts
- Abstract summary: We show how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.
We show that word embeddings trained on Baidu Baike, an online Chinese encyclopedia, have very different associations between adjectives and a range of concepts.
Our paper shows how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While artificial intelligence provides the backbone for many tools people use
around the world, recent work has brought to attention that the algorithms
powering AI are not free of politics, stereotypes, and bias. While most work in
this area has focused on the ways in which AI can exacerbate existing
inequalities and discrimination, very little work has studied how governments
actively shape training data. We describe how censorship has affected the
development of Wikipedia corpuses, text data which are regularly used for
pre-trained inputs into NLP algorithms. We show that word embeddings trained on
Baidu Baike, an online Chinese encyclopedia, have very different associations
between adjectives and a range of concepts about democracy, freedom, collective
action, equality, and people and historical events in China than its regularly
blocked but uncensored counterpart - Chinese language Wikipedia. We examine the
implications of these discrepancies by studying their use in downstream AI
applications. Our paper shows how government repression, censorship, and
self-censorship may impact training data and the applications that draw from
them.
Related papers
- Comparing diversity, negativity, and stereotypes in Chinese-language AI technologies: a case study on Baidu, Ernie and Qwen [1.3354439722832292]
We study Chinese-based tools by investigating social biases embedded in the major Chinese search engine, Baidu.
We collect over 30k views encoded in the aforementioned tools by prompting them for candidate words describing such groups.
We find that language models exhibit a larger variety of embedded views compared to the search engine, although Baidu and Qwen generate negative content more often than Ernie.
arXiv Detail & Related papers (2024-08-28T10:51:18Z) - Algorithmically Curated Lies: How Search Engines Handle Misinformation
about US Biolabs in Ukraine [39.58317527488534]
We conduct virtual agent-based algorithm audits of Google, Bing, and Yandex search outputs in June 2022.
We find significant disparities in misinformation exposure based on the language of search, with all search engines presenting a higher number of false stories in Russian.
These observations stress the possibility of AICSs being vulnerable to manipulation, in particular in the case of the unfolding propaganda campaigns.
arXiv Detail & Related papers (2024-01-24T22:15:38Z) - National Origin Discrimination in Deep-learning-powered Automated Resume
Screening [3.251347385432286]
Many companies and organizations have started to use some form of AIenabled auto mated tools to assist in their hiring process.
There are increasing concerns on unfair treatment to candidates, caused by underlying bias in AI systems.
This study examined deep learning methods, a recent technology breakthrough, with focus on their application to automated resume screening.
arXiv Detail & Related papers (2023-07-13T01:35:29Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - Faking Fake News for Real Fake News Detection: Propaganda-loaded
Training Data Generation [105.20743048379387]
We propose a novel framework for generating training examples informed by the known styles and strategies of human-authored propaganda.
Specifically, we perform self-critical sequence training guided by natural language inference to ensure the validity of the generated articles.
Our experimental results show that fake news detectors trained on PropaNews are better at detecting human-written disinformation by 3.62 - 7.69% F1 score on two public datasets.
arXiv Detail & Related papers (2022-03-10T14:24:19Z) - Dataset of Propaganda Techniques of the State-Sponsored Information
Operation of the People's Republic of China [0.0]
This research aims to bridge the information gap by providing a multi-labeled propaganda techniques dataset in Mandarin based on a state-backed information operation dataset provided by Twitter.
In addition to presenting the dataset, we apply a multi-label text classification using fine-tuned BERT.
arXiv Detail & Related papers (2021-06-14T16:11:13Z) - Cross-Domain Learning for Classifying Propaganda in Online Contents [67.10699378370752]
We present an approach to leverage cross-domain learning, based on labeled documents and sentences from news and tweets, as well as political speeches with a clear difference in their degrees of being propagandistic.
Our experiments demonstrate the usefulness of this approach, and identify difficulties and limitations in various configurations of sources and targets for the transfer step.
arXiv Detail & Related papers (2020-11-13T10:19:13Z) - FairCVtest Demo: Understanding Bias in Multimodal Learning with a
Testbed in Fair Automatic Recruitment [79.23531577235887]
This demo shows the capacity of the Artificial Intelligence (AI) behind a recruitment tool to extract sensitive information from unstructured data.
Aditionally, the demo includes a new algorithm for discrimination-aware learning which eliminates sensitive information in our multimodal AI framework.
arXiv Detail & Related papers (2020-09-12T17:45:09Z) - Bias in Multimodal AI: Testbed for Fair Automatic Recruitment [73.85525896663371]
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
We train automatic recruitment algorithms using a set of multimodal synthetic profiles consciously scored with gender and racial biases.
Our methodology and results show how to generate fairer AI-based tools in general, and in particular fairer automated recruitment systems.
arXiv Detail & Related papers (2020-04-15T15:58:05Z) - Explaining the Relationship between Internet and Democracy in Partly
Free Countries Using Machine Learning Models [0.0]
This study sheds new light on the effects of the internet on democratization in partly free countries.
Internet penetration and online censorship both have a negative impact on democracy scores.
Online censorship is the most important variable followed by governance index and education on democracy scores.
arXiv Detail & Related papers (2020-04-11T02:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.