The Great Data Standoff: Researchers vs. Platforms Under the Digital Services Act
- URL: http://arxiv.org/abs/2505.01122v1
- Date: Fri, 02 May 2025 09:00:19 GMT
- Title: The Great Data Standoff: Researchers vs. Platforms Under the Digital Services Act
- Authors: Catalina Goanta, Savvas Zannettou, Rishabh Kaushal, Jacob van de Kerkhof, Thales Bertaglia, Taylor Annabell, Haoyang Gui, Gerasimos Spanakis, Adriana Iamnitchi,
- Abstract summary: We focus on the 2024 Romanian presidential election interference incident.<n>This is the first event of its kind to trigger systemic risk investigations by the European Commission.<n>By analysing this incident, we can comprehend election-related systemic risk to explore practical research tasks.
- Score: 9.275892768167122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To facilitate accountability and transparency, the Digital Services Act (DSA) sets up a process through which Very Large Online Platforms (VLOPs) need to grant vetted researchers access to their internal data (Article 40(4)). Operationalising such access is challenging for at least two reasons. First, data access is only available for research on systemic risks affecting European citizens, a concept with high levels of legal uncertainty. Second, data access suffers from an inherent standoff problem. Researchers need to request specific data but are not in a position to know all internal data processed by VLOPs, who, in turn, expect data specificity for potential access. In light of these limitations, data access under the DSA remains a mystery. To contribute to the discussion of how Article 40 can be interpreted and applied, we provide a concrete illustration of what data access can look like in a real-world systemic risk case study. We focus on the 2024 Romanian presidential election interference incident, the first event of its kind to trigger systemic risk investigations by the European Commission. During the elections, one candidate is said to have benefited from TikTok algorithmic amplification through a complex dis- and misinformation campaign. By analysing this incident, we can comprehend election-related systemic risk to explore practical research tasks and compare necessary data with available TikTok data. In particular, we make two contributions: (i) we combine insights from law, computer science and platform governance to shed light on the complexities of studying systemic risks in the context of election interference, focusing on two relevant factors: platform manipulation and hidden advertising; and (ii) we provide practical insights into various categories of available data for the study of TikTok, based on platform documentation, data donations and the Research API.
Related papers
- Post-Post-API Age: Studying Digital Platforms in Scant Data Access Times [5.997153455641738]
The "post-API age" has sparked optimism about increased platform transparency and renewed opportunities for comprehensive research on digital platforms.<n>However, it remains unclear whether platforms provide adequate data access in practice.<n>Our findings reveal significant challenges in accessing social media data.<n>These challenges have exacerbated existing institutional, regional, and financial inequities in data access.
arXiv Detail & Related papers (2025-05-15T00:47:06Z) - Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.<n>We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.<n>Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset [20.315416393247247]
We introduce LeCaRDv2, a large-scale Legal Case Retrieval dataset (version 2).
It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents.
We enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure.
It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law.
arXiv Detail & Related papers (2023-10-26T17:32:55Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Should I disclose my dataset? Caveats between reproducibility and
individual data rights [5.816090284071069]
Digital availability of court documents increases possibilities for researchers.
However, personal data protection laws impose restrictions on data exposure.
We present legal and ethical considerations on the issue, as well as guidelines for researchers.
arXiv Detail & Related papers (2022-11-01T14:42:11Z) - PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection
of Open Data [3.2136309934080867]
Open data sets that contain personal information are susceptible to adversarial attacks even when anonymized.
We develop a visual analytic solution that enables data defenders to gain awareness about the disclosure risks in local, joinable data neighborhoods.
We use this problem and domain characterization to develop a set of visual analytic interventions as a defense mechanism.
arXiv Detail & Related papers (2022-08-12T19:57:09Z) - Having your Privacy Cake and Eating it Too: Platform-supported Auditing
of Social Media Algorithms for Public Interest [70.02478301291264]
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse.
Prior studies have used black-box methods to show that these algorithms can lead to biased or discriminatory outcomes.
We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation.
arXiv Detail & Related papers (2022-07-18T17:32:35Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Amplifying Privacy: Scaling Up Transparency Research Through Delegated
Access Requests [0.0]
We present an alternative method to ask participants to delegate their right of access to the researchers.
We discuss the legal grounds for doing this, the advantages it can bring to both researchers and data subjects, and present a procedural and technical design.
We tested our method in a pilot study in the Netherlands, and found that it creates a win-win for both the researchers and the participants.
arXiv Detail & Related papers (2021-06-12T19:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.