A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis
- URL: http://arxiv.org/abs/2506.23474v1
- Date: Mon, 30 Jun 2025 02:37:27 GMT
- Title: A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis
- Authors: Zhiwei Lin, Bonan Ruan, Jiahao Liu, Weibo Zhao,
- Abstract summary: We introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP clients.<n>Each artifact is annotated with 20+ normalized attributes capturing its identity, interface configuration, GitHub activity, and metadata.<n> MCPCorpus provides a reproducible snapshot of the real-world MCP ecosystem, enabling studies of adoption trends, ecosystem health, and implementation diversity.
- Score: 8.943261888363622
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The Model Context Protocol (MCP) has recently emerged as a standardized interface for connecting language models with external tools and data. As the ecosystem rapidly expands, the lack of a structured, comprehensive view of existing MCP artifacts presents challenges for research. To bridge this gap, we introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP clients. Each artifact is annotated with 20+ normalized attributes capturing its identity, interface configuration, GitHub activity, and metadata. MCPCorpus provides a reproducible snapshot of the real-world MCP ecosystem, enabling studies of adoption trends, ecosystem health, and implementation diversity. To keep pace with the rapid evolution of the MCP ecosystem, we provide utility tools for automated data synchronization, normalization, and inspection. Furthermore, to support efficient exploration and exploitation, we release a lightweight web-based search interface. MCPCorpus is publicly available at: https://github.com/Snakinya/MCPCorpus.
Related papers
- LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? [50.60770039016318]
We present LiveMCPBench, the first comprehensive benchmark for benchmarking Model Context Protocol (MCP) agents.<n>LiveMCPBench consists of 95 real-world tasks grounded in the MCP ecosystem.<n>Our evaluation covers 10 leading models, with the best-performing model reaching a 78.95% success rate.
arXiv Detail & Related papers (2025-08-03T14:36:42Z) - Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs [0.0]
We present AutoMCP, a compiler that generates MCP servers from OpenAPI 2.0/3.0 specifications.<n>We evaluate AutoMCP on 50 real-world APIs spanning 5,066 endpoints across over 10 domains.
arXiv Detail & Related papers (2025-07-21T20:20:31Z) - We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems [28.59170303701817]
We conduct the first large-scale empirical analysis of Model Context Protocol security risks.<n>We examine 2,562 real-world MCP applications spanning 23 functional categories.<n>We propose a detailed taxonomy of MCP resource access, quantify security-relevant API usage, and identify open challenges for building safer MCP ecosystems.
arXiv Detail & Related papers (2025-07-05T03:39:30Z) - Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers [16.794115541448758]
Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem in late 2024.<n>Despite its adoption, MCP's AI-driven, non-deterministic control flow introduces new risks to sustainability, security, and maintainability.<n>We evaluate 1,899 open-source MCP servers to assess their health, security, and maintainability.
arXiv Detail & Related papers (2025-06-16T14:26:37Z) - Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe Microscopy [54.24356756795849]
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales.<n>The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access.<n> deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies.
arXiv Detail & Related papers (2025-06-10T03:54:36Z) - Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents [57.59830804627066]
We introduce MONDAY, a large-scale dataset of 313K annotated frames from 20K instructional videos capturing real-world mobile OS navigation.<n>Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities.<n>We present an automated framework that leverages publicly available video content to create comprehensive task datasets.
arXiv Detail & Related papers (2025-05-19T02:39:03Z) - Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions [5.1875389249043415]
The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources.<n>This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers.<n>We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats.
arXiv Detail & Related papers (2025-03-30T01:58:22Z) - MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z) - Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks.
These resources lack a classification tailored to Software Engineering (SE) needs.
We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z) - What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception [52.41695608928129]
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources.
This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view.
We propose a novel framework named CMiMC for intermediate collaboration.
arXiv Detail & Related papers (2024-03-15T07:18:55Z) - Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems [45.05372822216111]
Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected.
However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISPDM), often fail due to the disproportionate amount of time needed for understanding and preparing the data.
This contribution intends present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS challenges.
arXiv Detail & Related papers (2023-07-21T15:04:00Z) - MLOps: A Step Forward to Enterprise Machine Learning [0.0]
This research presents a detailed review of MLOps, its benefits, difficulties, evolutions, and important underlying technologies.
The MLOps workflow is explained in detail along with the various tools necessary for both model and data exploration and deployment.
This article also puts light on the end-to-end production of ML projects using various maturity levels of automated pipelines.
arXiv Detail & Related papers (2023-05-27T20:44:14Z) - KG-Hub -- Building and Exchanging Biological Knowledge Graphs [0.5369297590461578]
KG-Hub is a platform that enables standardized construction, exchange, and reuse of knowledge graphs.
Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research.
arXiv Detail & Related papers (2023-01-31T21:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.