How Sovereign Is Sovereign Compute? A Review of 775 Non-U.S. Data Centers
- URL: http://arxiv.org/abs/2508.00932v1
- Date: Wed, 30 Jul 2025 22:58:42 GMT
- Title: How Sovereign Is Sovereign Compute? A Review of 775 Non-U.S. Data Centers
- Authors: Aris Richardson, Haley Yi, Michelle Nie, Simon Wisdom, Casey Price, Ruben Weijers, Steven Veld, Mauricio Baker,
- Abstract summary: This paper estimates how often data centers could be subject to foreign legal authorities due to the nationality of the data center operators.<n>We find that U.S. companies operate 48% of all non-U.S. data center projects in our dataset when weighted by investment value.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous literature has proposed that the companies operating data centers enforce government regulations on AI companies. Using a new dataset of 775 non-U.S. data center projects, this paper estimates how often data centers could be subject to foreign legal authorities due to the nationality of the data center operators. We find that U.S. companies operate 48% of all non-U.S. data center projects in our dataset when weighted by investment value - a proxy for compute capacity. This is an approximation based on public data and should be interpreted as an initial estimate. For the United States, our findings suggest that data center operators offer a lever for internationally governing AI that complements traditional export controls, since operators can be used to regulate computing resources already deployed in non-U.S. data centers. For other countries, our results show that building data centers locally does not guarantee digital sovereignty if those facilities are run by foreign entities. To support future research, we release our dataset, which documents over 20 variables relating to each data center, including the year it was announced, the investment value, and its operator's national affiliation. The dataset also includes over 1,000 quotes describing these data centers' strategic motivations, operational challenges, and engagement with U.S. and Chinese entities.
Related papers
- What AI Speaks for Your Community: Polling AI Agents for Public Opinion on Data Center Projects [16.822770693792826]
We introduce an AI agent polling framework to assess community opinion on data centers.<n>Our experiments reveal water consumption and utility costs as primary concerns, while tax revenue is a key perceived benefit.<n>Our framework can serve as a scalable screening tool, enabling developers to integrate community sentiment into early-stage planning.
arXiv Detail & Related papers (2025-11-27T02:46:36Z) - What's the next frontier for Data-centric AI? Data Savvy Agents [71.76058707995398]
We argue that data-savvy capabilities should be a top priority in the design of agentic systems.<n>We propose four key capabilities to realize this vision: Proactive data acquisition, Sophisticated data processing, Interactive test data synthesis, and Continual adaptation.
arXiv Detail & Related papers (2025-11-02T17:09:29Z) - Real-World En Call Center Transcripts Dataset with PII Redaction [0.8077903172320928]
CallCenterEN is a large-scale (91,706 conversations, corresponding to 10448 audio hours) real-world English call center transcript dataset.<n>This is the largest release to-date of open source call center transcript data of this kind.<n>The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States.
arXiv Detail & Related papers (2025-06-30T03:41:02Z) - Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.<n>Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.<n>We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - Future and AI-Ready Data Strategies: Response to DOC RFI on AI and Open Government Data Assets [6.659894897434807]
The following is a response to the US Department of Commerce's Request for Information (RFI) regarding AI and Open Government Data Assets.<n>We commend the Department for its initiative in seeking public insights on the organization and sharing of data.<n>In our response, we outline best practices and key considerations for AI and the Department of Commerce's Open Government Data Assets.
arXiv Detail & Related papers (2024-07-26T07:31:32Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - DAVED: Data Acquisition via Experimental Design for Data Markets [25.300193837833426]
We propose a federated approach to the data acquisition problem that is inspired by linear experimental design.
Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data.
The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
arXiv Detail & Related papers (2024-03-20T18:05:52Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Compute at Scale: A Broad Investigation into the Data Center Industry [0.8547032097715571]
The global industry is valued at approximately $250B and is expected to double over the next seven years.
There are likely about 500 large (above 10 MW) data centers globally, with the US, Europe, and China constituting the most important markets.
arXiv Detail & Related papers (2023-11-05T13:39:59Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education.
Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation.
We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z) - Data Governance in the Age of Large-Scale Data-Driven Language
Technology [79.92626780294258]
This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights.
The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
arXiv Detail & Related papers (2022-05-04T00:44:35Z) - Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics [58.50754318846996]
In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
arXiv Detail & Related papers (2022-03-29T18:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.