GitHub Proxy Server: A tool for supporting massive data collection on GitHub
- URL: http://arxiv.org/abs/2505.18305v1
- Date: Fri, 23 May 2025 19:00:32 GMT
- Title: GitHub Proxy Server: A tool for supporting massive data collection on GitHub
- Authors: Hudson Silva Borges, Marco Tulio Valente,
- Abstract summary: GitHub is the most popular social coding platform and widely used by developers and organizations to host their open-source projects around the world.<n>The platform has a web API that allow developers collect information from public repositories hosted on it.<n>However, collecting massive amount of data from GitHub can be very challenging due to existing restrictions and abuse detection mechanisms.<n>We present a tool, called GitHub Proxy Server, which abstracts such complexities into a tool that is independent on operational system and programming language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: GitHub is the most popular social coding platform and widely used by developers and organizations to host their open-source projects around the world. Besides that, the platform has a web API that allow developers collect information from public repositories hosted on it. However, collecting massive amount of data from GitHub can be very challenging due to existing restrictions and abuse detection mechanisms. In this work, we present a tool, called GitHub Proxy Server, which abstracts such complexities into a tool that is independent on operational system and programming language. We show that, using the proposed tool, it is possible to improve the performance of GitHub mining tasks without any additional complexities.
Related papers
- SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - SocialED: A Python Library for Social Event Detection [53.928241775629566]
SocialED is a comprehensive, open-source Python library designed to support social event detection (SED) tasks.<n>It provides a unified API with detailed documentation, offering researchers and practitioners a complete solution for event detection in social media.<n>SocialED supports a wide range of preprocessing techniques, such as graph construction and tokenization, and includes standardized interfaces for training models and making predictions.
arXiv Detail & Related papers (2024-12-18T03:37:47Z) - 4.5 Million (Suspected) Fake Stars in GitHub: A Growing Spiral of Popularity Contests, Scams, and Malware [58.60545935390151]
We present a global, longitudinal measurement study of fake stars in GitHub.<n>We build StarScout, a scalable tool able to detect anomalous starring behaviors.<n>Our study has implications for platform moderators, open-source practitioners, and supply chain security researchers.
arXiv Detail & Related papers (2024-12-18T03:03:58Z) - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [109.8507367518992]
We introduce OpenHands, a platform for the development of AI agents that interact with the world in similar ways to a human developer.<n>We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks.
arXiv Detail & Related papers (2024-07-23T17:50:43Z) - MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z) - GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [81.44231422624055]
A growing area of research focuses on Large Language Models (LLMs) equipped with external tools capable of performing diverse tasks.
In this paper, we introduce GitAgent, an agent capable of achieving the autonomous tool extension from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - Open Data on GitHub: Unlocking the Potential of AI [2.3324945410076685]
GitHub is the world's largest platform for collaborative software development, with over 100 million users.
This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research.
arXiv Detail & Related papers (2023-06-09T18:43:26Z) - TorchRL: A data-driven decision-making library for PyTorch [20.776851077664915]
PyTorch has ascended as a premier machine learning framework, yet it lacks a native and comprehensive library for decision and control tasks.
We propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components.
We provide a detailed description of the building blocks and an extensive overview of the library across domains and tasks.
arXiv Detail & Related papers (2023-06-01T11:45:45Z) - Testing GitHub projects on custom resources using unprivileged
Kubernetes runners [1.137903861863692]
GitHub is a popular repository for hosting software projects.
Native GitHub Actions make it easy for software developers to validate new commits and have confidence that new code does not introduce major bugs.
The freely available test environments are limited to only a few popular setups but can be extended with custom Action Runners.
arXiv Detail & Related papers (2023-05-17T16:31:41Z) - The GitHub Development Workflow Automation Ecosystems [47.818229204130596]
Large-scale software development has become a highly collaborative endeavour.
This chapter explores the ecosystems of development bots and GitHub Actions.
It provides an extensive survey of the state-of-the-art in this domain.
arXiv Detail & Related papers (2023-05-08T15:24:23Z) - GitHub Actions: The Impact on the Pull Request Process [7.047566396769727]
This study investigates how projects use GitHub Actions, what the developers discuss about them, and how project activity indicators change after their adoption.
Our results indicate that 1,489 out of 5,000 most popular repositories (almost 30% of our sample) adopt GitHub Actions.
Our findings also suggest that the adoption of GitHub Actions leads to more rejections of pull requests (PRs), more communication in accepted PRs and less communication in rejected PRs.
arXiv Detail & Related papers (2022-06-28T16:24:17Z) - The penumbra of open source: projects outside of centralized platforms
are longer maintained, more academic and more collaborative [0.0]
We develop a novel, extensive sample of public open source project repositories outside of centralized platforms.
Our sample projects tend to have more collaborators, are maintained for longer periods, and tend to be more focused on academic and scientific problems.
arXiv Detail & Related papers (2021-06-29T17:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.