Related papers: Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

URL: http://arxiv.org/abs/2509.17665v1
Date: Mon, 22 Sep 2025 12:09:21 GMT
Title: Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models
Authors: Katharina Simbeck, Mariam Mahran,
Abstract summary: This paper explores how religion is internally represented in large language models (LLMs)<n>We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts.<n>While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.

Related papers

Accumulating Context Changes the Beliefs of Language Models [44.87674077524695]
Language model assistants are increasingly used in applications such as brainstorming and research.<n>This paper explores how accumulating context by engaging in interactions and processing text can change the beliefs of language models.<n>We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems.
arXiv Detail & Related papers (2025-11-03T18:05:57Z)
Detecting Religious Language in Climate Discourse [1.1707176242280342]
This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs)<n>We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting.<n>Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence.
arXiv Detail & Related papers (2025-10-27T14:54:51Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.<n>We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.<n>Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
Religious Bias Landscape in Language and Text-to-Image Models: Analysis, Detection, and Debiasing Strategies [16.177734242454193]
The widespread adoption of language models highlights the need for critical examinations of their inherent biases.<n>This study systematically investigates religious bias in both language models and text-to-image generation models.
arXiv Detail & Related papers (2025-01-14T21:10:08Z)
Computational Analysis of Character Development in Holocaust Testimonies [13.639727580099484]
This work presents a computational approach to analyze character development along the narrative timeline.<n>We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms.<n>We focus on the survivor's religious trajectory, examining the evolution of their disposition toward religious belief and practice.
arXiv Detail & Related papers (2024-12-22T15:20:53Z)
Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.<n>This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z)
Divine LLaMAs: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models [19.54202714712677]
Religion as a socio-cultural system prescribes a set of beliefs and values for its followers. Unlike gender, which says little about our values, religion prescribes a set of beliefs and values for its followers. Major religions in the US and European countries are represented with more nuance. Eastern religions like Hinduism and Buddhism are strongly stereotyped.
arXiv Detail & Related papers (2024-07-09T14:45:15Z)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages.<n>We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z)
Religion and Spirituality on Social Media in the Aftermath of the Global Pandemic [59.930429668324294]
We analyse the sudden change in religious activities twofold: we create and deliver a questionnaire, as well as analyse Twitter data. Importantly, we also analyse the temporal variations in this process by analysing a period of 3 months: July-September 2020.
arXiv Detail & Related papers (2022-12-11T18:41:02Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.