How Contentious Terms About People and Cultures are Used in Linked Open
Data
- URL: http://arxiv.org/abs/2311.10757v1
- Date: Mon, 13 Nov 2023 18:25:20 GMT
- Title: How Contentious Terms About People and Cultures are Used in Linked Open
Data
- Authors: Andrei Nesterov (1), Laura Hollink (1), Jacco van Ossenbruggen (2)
((1) Centrum Wiskunde & Informatica, (2) VU University Amsterdam)
- Abstract summary: When outdated and culturally stereotyping terminology is used in literals, they may appear as offensive to users in interfaces and propagate stereotypes to algorithms trained on them.
We study how frequently and in which literals contentious terms about people and cultures occur in linked open data (LOD)
We inspect occurrences of these terms in four widely used datasets: Wikidata, The Getty Art & Architecture Thesaurus, Princeton WordNet, and Open Dutch WordNet.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Web resources in linked open data (LOD) are comprehensible to humans through
literal textual values attached to them, such as labels, notes, or comments.
Word choices in literals may not always be neutral. When outdated and
culturally stereotyping terminology is used in literals, they may appear as
offensive to users in interfaces and propagate stereotypes to algorithms
trained on them. We study how frequently and in which literals contentious
terms about people and cultures occur in LOD and whether there are attempts to
mark the usage of such terms. For our analysis, we reuse English and Dutch
terms from a knowledge graph that provides opinions of experts from the
cultural heritage domain about terms' contentiousness. We inspect occurrences
of these terms in four widely used datasets: Wikidata, The Getty Art &
Architecture Thesaurus, Princeton WordNet, and Open Dutch WordNet. Some terms
are ambiguous and contentious only in particular senses. Applying word sense
disambiguation, we generate a set of literals relevant to our analysis. We
found that outdated, derogatory, stereotyping terms frequently appear in
descriptive and labelling literals, such as preferred labels that are usually
displayed in interfaces and used for indexing. In some cases, LOD contributors
mark contentious terms with words and phrases in literals (implicit markers) or
properties linked to resources (explicit markers). However, such marking is
rare and non-consistent in all datasets. Our quantitative and qualitative
insights could be helpful in developing more systematic approaches to address
the propagation of stereotypes via LOD.
Related papers
- Situated Ground Truths: Enhancing Bias-Aware AI by Situating Data Labels with SituAnnotate [0.1843404256219181]
SituAnnotate is a novel ontology-based approach to structured and context-aware data annotation.
It aims to anchor the ground truth data employed in training AI systems within the contextual and culturally-bound situations.
As a method to create, query, and compare label-based datasets, SituAnnotate empowers downstream AI systems to undergo training with explicit consideration of context and cultural bias.
arXiv Detail & Related papers (2024-06-10T09:33:13Z) - Interpretable Word Sense Representations via Definition Generation: The
Case of Semantic Change Analysis [3.515619810213763]
We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations.
We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable.
arXiv Detail & Related papers (2023-05-19T20:36:21Z) - ezCoref: Towards Unifying Annotation Guidelines for Coreference
Resolution [28.878540389202367]
We develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial.
We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets.
Surprisingly, we find that reasonable quality annotations were already achievable (>90% agreement between the crowd and expert annotations) even without extensive training.
arXiv Detail & Related papers (2022-10-13T17:09:59Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - The Curious Layperson: Fine-Grained Image Recognition without Expert
Labels [90.88501867321573]
We consider a new problem: fine-grained image recognition without expert annotations.
We learn a model to describe the visual appearance of objects using non-expert image descriptions.
We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
arXiv Detail & Related papers (2021-11-05T17:58:37Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Do Context-Aware Translation Models Pay the Right Attention? [61.25804242929533]
Context-aware machine translation models are designed to leverage contextual information, but often fail to do so.
In this paper, we ask several questions: What contexts do human translators use to resolve ambiguous words?
We introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14K translations.
Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words.
arXiv Detail & Related papers (2021-05-14T17:32:24Z) - Measuring and Increasing Context Usage in Context-Aware Machine
Translation [64.5726087590283]
We introduce a new metric, conditional cross-mutual information, to quantify the usage of context by machine translation models.
We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models.
arXiv Detail & Related papers (2021-05-07T19:55:35Z) - Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents [17.672677325827454]
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations.
We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines.
The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.
arXiv Detail & Related papers (2020-10-30T16:39:49Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.