Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing
- URL: http://arxiv.org/abs/2404.14740v3
- Date: Tue, 22 Jul 2025 04:58:13 GMT
- Title: Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing
- Authors: Ben Hutchinson,
- Abstract summary: Religious texts are expressions of culturally important values.<n>Machine learned models have a propensity to reproduce cultural values encoded in their training data.<n>This paper argues that NLP's use of such texts raises considerations that go beyond model biases.
- Score: 1.7794383050238662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP's use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities.
Related papers
- Detecting Religious Language in Climate Discourse [1.1707176242280342]
This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs)<n>We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting.<n>Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence.
arXiv Detail & Related papers (2025-10-27T14:54:51Z) - Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation [70.43884512651668]
We formalize Genette's (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for machine translation.<n>We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai.<n>Our findings demonstrate the potential of paratextual explicitation in advancing machine translation beyond linguistic equivalence.
arXiv Detail & Related papers (2025-09-27T16:27:36Z) - Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish? [0.0]
Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain.<n>We first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants.<n>We assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern.<n>This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use.
arXiv Detail & Related papers (2025-06-04T16:31:37Z) - SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation [55.61004653386632]
Large Language Models (LLMs) often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context.
This paper introduces a novel self-supervised method for generating a training set of unfaithful samples.
We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones.
arXiv Detail & Related papers (2025-02-19T12:31:58Z) - Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce [27.918975040084387]
Data in a given language should be viewed as more than a collection of tokens.
Good data collection and labeling practices are key to building more human-centered and socially aware technologies.
arXiv Detail & Related papers (2024-10-16T15:51:18Z) - The Nature of NLP: Analyzing Contributions in NLP Papers [77.31665252336157]
We quantitatively investigate what constitutes NLP research by examining research papers.
Our findings reveal a rising involvement of machine learning in NLP since the early nineties.
In post-2020, there has been a resurgence of focus on language and people.
arXiv Detail & Related papers (2024-09-29T01:29:28Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background
Prediction in English [25.38572483508948]
We augment natural language processing models with cultural background features.
We show that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US.
Our findings support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.
arXiv Detail & Related papers (2022-03-28T04:57:17Z) - Towards Responsible Natural Language Annotation for the Varieties of
Arabic [12.526184907781731]
We present a playbook for responsible dataset creation for polyglossic, multidialectal languages.
This work is informed by a study on Arabic annotation of social media content.
arXiv Detail & Related papers (2022-03-17T20:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.