Auditing Meta and TikTok Research API Data Access under Article 40(12) of the Digital Services Act
- URL: http://arxiv.org/abs/2601.12390v1
- Date: Sun, 18 Jan 2026 12:59:11 GMT
- Title: Auditing Meta and TikTok Research API Data Access under Article 40(12) of the Digital Services Act
- Authors: Luka Bekavac, Simon Mayer,
- Abstract summary: This paper presents a systematic audit of research access modalities by comparing data obtained through platform Research APIs with data collected about the same platforms' user-visible public information environment (PIE)<n>Our findings show systematic data loss through three classes of platform-imposed mechanisms: scope narrowing, metadata stripping, and operational restrictions.<n>We conclude that, in their current form, the Meta and TikTok Research APIs fall short of supporting meaningful, independent auditing of systemic risks as envisioned under the Digital Services Act (DSA)
- Score: 8.348593305367523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Article 40(12) of the Digital Services Act (DSA) requires Very Large Online Platforms (VLOPs) to provide vetted researchers with access to publicly accessible data. While prior work has identified shortcomings of platform-provided data access mechanisms, existing research has not quantitatively assessed data quality and completeness in Research APIs across platforms, nor systematically mapped how current access provisions fall short. This paper presents a systematic audit of research access modalities by comparing data obtained through platform Research APIs with data collected about the same platforms' user-visible public information environment (PIE). Focusing on two major platform APIs, the TikTok Research API and the Meta Content Library, we reconstruct full information feeds for two controlled sockpuppet accounts during two election periods and benchmark these against the data retrievable for the same posts through the corresponding Research APIs. Our findings show systematic data loss through three classes of platform-imposed mechanisms: scope narrowing, metadata stripping, and operational restrictions. Together, these mechanisms implement overlapping filters that exclude large portions of the platform PIE (up to approximately 50 percent), strip essential contextual metadata (up to approximately 83 percent), and impose severe technical constraints for researchers (down to approximately 1000 requests per day). Viewed through a data quality lens, these filters primarily undermine completeness, resulting in a structurally biased representation of platform activity. We conclude that, in their current form, the Meta and TikTok Research APIs fall short of supporting meaningful, independent auditing of systemic risks as envisioned under the DSA.
Related papers
- OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models [0.6958509696068848]
This paper introduces DSCHECKER, an LLM-based approach for detecting and fixing API misuses of data science libraries.<n>We identify two key pieces of information, API directives and data information, that may be beneficial for API misuse detection and fixing.<n>We find that Dschecker agent achieves 48.65 percent detection F1-score and fixes 39.47 percent of the misuses.
arXiv Detail & Related papers (2025-09-29T18:30:02Z) - BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z) - TikTok's Research API: Problems Without Explanations [2.06242362470764]
TikTok augmented its Research API access within Europe in July 2023.<n>Despite this expansion, notable limitations and inconsistencies persist within the data provided.<n>The API data is incomplete, making it unreliable when working with data donations.
arXiv Detail & Related papers (2025-06-11T13:50:06Z) - Post-Post-API Age: Studying Digital Platforms in Scant Data Access Times [5.997153455641738]
The "post-API age" has sparked optimism about increased platform transparency and renewed opportunities for comprehensive research on digital platforms.<n>However, it remains unclear whether platforms provide adequate data access in practice.<n>Our findings reveal significant challenges in accessing social media data.<n>These challenges have exacerbated existing institutional, regional, and financial inequities in data access.
arXiv Detail & Related papers (2025-05-15T00:47:06Z) - The Great Data Standoff: Researchers vs. Platforms Under the Digital Services Act [9.275892768167122]
We focus on the 2024 Romanian presidential election interference incident.<n>This is the first event of its kind to trigger systemic risk investigations by the European Commission.<n>By analysing this incident, we can comprehend election-related systemic risk to explore practical research tasks.
arXiv Detail & Related papers (2025-05-02T09:00:19Z) - From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories [3.323388021979584]
Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems.<n>This review systematically analyzes methods from traditional blacklisting to advanced deep learning approaches.<n>Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities.
arXiv Detail & Related papers (2025-04-23T06:23:18Z) - A Comprehensive Survey on Imbalanced Data Learning [56.65067795190842]
imbalanced data is prevalent in various types of raw data and hinders the performance of machine learning.<n>This survey systematically analyzes various real-world data formats.<n>It concludes existing researches for different data formats into four categories: data re-balancing, feature representation, training strategy, and ensemble learning.
arXiv Detail & Related papers (2025-02-13T04:53:17Z) - Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.<n>The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z) - Having your Privacy Cake and Eating it Too: Platform-supported Auditing
of Social Media Algorithms for Public Interest [70.02478301291264]
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse.
Prior studies have used black-box methods to show that these algorithms can lead to biased or discriminatory outcomes.
We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation.
arXiv Detail & Related papers (2022-07-18T17:32:35Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.