The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale
- URL: http://arxiv.org/abs/2509.25649v1
- Date: Tue, 30 Sep 2025 01:41:49 GMT
- Title: The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale
- Authors: Samar Haider, Amir Tohidi, Jenny S. Wang, Timothy Dörr, David M. Rothschild, Chris Callison-Burch, Duncan J. Watts,
- Abstract summary: We introduce a large, ongoing, near real-time dataset and computational framework to study selection and framing bias in news coverage.<n>Our pipeline integrates large language models with scalable, near-real-time news scraping to extract structured annotations.<n>We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level.
- Score: 24.955234806377643
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.
Related papers
- Journalism-Guided Agentic In-Context Learning for News Stance Detection [3.9861857176369564]
Stance detection can help enable viewpoint-aware recommendations and data-driven analyses of media bias.<n>We introduce textscK-News-Stance, the first Korean dataset for article-level stance detection.<n>We also propose textscJoA-ICL, a textbfJournalism-guided textbfAgentic textbfIn-textbfContext textbfL framework.
arXiv Detail & Related papers (2025-07-15T07:22:04Z) - Media Bias Detector: Designing and Implementing a Tool for Real-Time Selection and Framing Bias Analysis in News Coverage [29.438946779179346]
We introduce the Media Bias Detector, a tool for researchers, journalists, and news consumers.<n>By integrating large language models, we provide near real-time granular insights into the topics, tone, political lean, and facts of news articles aggregated to the publisher level.
arXiv Detail & Related papers (2025-02-09T19:54:31Z) - Mapping the Media Landscape: Predicting Factual Reporting and Political Bias Through Web Interactions [0.7249731529275342]
We propose an extension to a recently presented news media reliability estimation method.
We assess the classification performance of four reinforcement learning strategies on a large news media hyperlink graph.
Our experiments, targeting two challenging bias descriptors, factual reporting and political bias, showed a significant performance improvement at the source media level.
arXiv Detail & Related papers (2024-10-23T08:18:26Z) - Position: AI/ML Influencers Have a Place in the Academic Process [82.2069685579588]
We investigate the role of social media influencers in enhancing the visibility of machine learning research.
We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023.
Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers.
arXiv Detail & Related papers (2024-01-24T20:05:49Z) - The Media Bias Taxonomy: A Systematic Literature Review on the Forms and
Automated Detection of Media Bias [5.579028648465784]
This article summarizes the research on computational methods to detect media bias by systematically reviewing 3140 research papers published between 2019 and 2022.
We show that media bias detection is a highly active research field, in which transformer-based classification approaches have led to significant improvements in recent years.
arXiv Detail & Related papers (2023-12-26T18:13:52Z) - Towards Corpus-Scale Discovery of Selection Biases in News Coverage:
Comparing What Sources Say About Entities as a Start [65.28355014154549]
This paper investigates the challenges of building scalable NLP systems for discovering patterns of media selection biases directly from news content in massive-scale news corpora.
We show the capabilities of the framework through a case study on NELA-2020, a corpus of 1.8M news articles in English from 519 news sources worldwide.
arXiv Detail & Related papers (2023-04-06T23:36:45Z) - Computational Assessment of Hyperpartisanship in News Titles [55.92100606666497]
We first adopt a human-guided machine learning framework to develop a new dataset for hyperpartisan news title detection.
Overall the Right media tends to use proportionally more hyperpartisan titles.
We identify three major topics including foreign issues, political systems, and societal issues that are suggestive of hyperpartisanship in news titles.
arXiv Detail & Related papers (2023-01-16T05:56:58Z) - Unveiling the Hidden Agenda: Biases in News Reporting and Consumption [59.55900146668931]
We build a six-year dataset on the Italian vaccine debate and adopt a Bayesian latent space model to identify narrative and selection biases.
We found a nonlinear relationship between biases and engagement, with higher engagement for extreme positions.
Analysis of news consumption on Twitter reveals common audiences among news outlets with similar ideological positions.
arXiv Detail & Related papers (2023-01-14T18:58:42Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias [54.89737992911079]
We propose a new task, a neutral summary generation from multiple news headlines of the varying political spectrum.
One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content, but also politically biased content.
arXiv Detail & Related papers (2022-04-11T07:06:01Z) - MBIC -- A Media Bias Annotation Dataset Including Annotator
Characteristics [0.0]
Media bias, or slanted news coverage, can have a substantial impact on public perception of events.
In this poster, we present a matrix-based methodology to crowdsource such data using a self-developed annotation platform.
We also present MBIC - the first sample of 1,700 statements representing various media bias instances.
arXiv Detail & Related papers (2021-05-20T15:05:17Z) - "Don't quote me on that": Finding Mixtures of Sources in News Articles [85.92467549469147]
We construct an ontological labeling system for sources based on each source's textitaffiliation and textitrole
We build a probabilistic model to infer these attributes for named sources and to describe news articles as mixtures of these sources.
arXiv Detail & Related papers (2021-04-19T21:57:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.