taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
- URL: http://arxiv.org/abs/2506.05388v1
- Date: Tue, 03 Jun 2025 16:24:33 GMT
- Title: taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
- Authors: Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen,
- Abstract summary: We present taz2024full, the largest publicly available corpus of German newspaper articles to date, spanning 1980 to 2024.<n>As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting.<n>Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts.
- Score: 0.20971479389679337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.
Related papers
- Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language [21.87606488958834]
We present five German datasets for gender bias evaluation in large language models (LLMs)<n>The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies.<n>Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German.
arXiv Detail & Related papers (2025-07-22T13:09:41Z) - Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia [49.80565462746646]
We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level.
We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias.
arXiv Detail & Related papers (2024-10-05T20:40:49Z) - The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification [57.06913662622832]
Gender-fair language fosters inclusion by addressing all genders or using neutral forms.
Gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns.
While we offer initial insights on the effect on German text classification, the findings likely apply to other languages.
arXiv Detail & Related papers (2024-09-26T15:08:17Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora [9.959039325564744]
Gender bias in text corpora can lead to perpetuation and amplification of societal inequalities.
Existing methods to measure gender representation bias in text corpora have mainly been proposed for English.
This paper introduces a novel methodology to quantitatively measure gender representation bias in Spanish corpora.
arXiv Detail & Related papers (2024-06-19T16:30:58Z) - White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z) - GATE X-E : A Challenge Set for Gender-Fair Translations from
Weakly-Gendered Languages [0.0]
We introduce GATE X-E, an extension to the GATE corpus, that consists of human translations from Turkish, Hungarian, Finnish, and Persian into English.
The dataset features natural sentences with a wide range of sentence lengths and domains, challenging translation rewriters on various linguistic phenomena.
We present a translation gender rewriting solution built with GPT-4 and use GATE X-E to evaluate it.
arXiv Detail & Related papers (2024-02-22T04:36:14Z) - Less than one percent of words would be affected by gender-inclusive
language in German press texts [43.16629507708997]
We show that, on average, less than 1% of all tokens would be affected by gender-inclusive language.
This small proportion calls into question whether gender-inclusive German presents a substantial barrier to understanding and learning the language.
arXiv Detail & Related papers (2024-02-06T10:32:34Z) - Evaluating Gender Bias in the Translation of Gender-Neutral Languages
into English [0.0]
We introduce GATE X-E, an extension to the GATE corpus, that consists of human translations from Turkish, Hungarian, Finnish, and Persian into English.
The dataset features natural sentences with a wide range of sentence lengths and domains, challenging translation rewriters on various linguistic phenomena.
We present an English gender rewriting solution built on GPT-3.5 Turbo and use GATE X-E to evaluate it.
arXiv Detail & Related papers (2023-11-15T10:25:14Z) - Bias at a Second Glance: A Deep Dive into Bias for German Educational
Peer-Review Data Modeling [10.080007569933331]
We analyze bias across text and through multiple architectures on a corpus of 9,165 German peer- reviews over five years.
Our collected corpus does not reveal many biases in the co-occurrence analysis or in the GloVe embeddings.
Pre-trained German language models find substantial conceptual, racial, and gender bias.
arXiv Detail & Related papers (2022-09-21T13:08:16Z) - Quantifying Gender Bias Towards Politicians in Cross-Lingual Language
Models [104.41668491794974]
We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender.
We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians.
arXiv Detail & Related papers (2021-04-15T15:03:26Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - Multi-Dimensional Gender Bias Classification [67.65551687580552]
Machine learning models can inadvertently learn socially undesirable patterns when training on gender biased text.
We propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions.
Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information.
arXiv Detail & Related papers (2020-05-01T21:23:20Z) - A Framework for the Computational Linguistic Analysis of Dehumanization [52.735780962665814]
We analyze discussions of LGBTQ people in the New York Times from 1986 to 2015.
We find increasingly humanizing descriptions of LGBTQ people over time.
The ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.
arXiv Detail & Related papers (2020-03-06T03:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.