Beneath the Mask: Can Contribution Data Unveil Malicious Personas in Open-Source Projects?
- URL: http://arxiv.org/abs/2508.13453v1
- Date: Tue, 19 Aug 2025 02:17:48 GMT
- Title: Beneath the Mask: Can Contribution Data Unveil Malicious Personas in Open-Source Projects?
- Authors: Ruby Nealon,
- Abstract summary: This paper demonstrates how Open Source Intelligence (OSINT) data gathered from GitHub contributions, analyzed using graph databases and graph theory, can efficiently identify anomalous behaviors exhibited by the "JiaT75" personas across other open-source projects.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In February 2024, after building trust over two years with project maintainers by making a significant volume of legitimate contributions, GitHub user "JiaT75" self-merged a version of the XZ Utils project containing a highly sophisticated, well-disguised backdoor targeting sshd processes running on systems with the backdoored package installed. A month later, this package began to be distributed with popular Linux distributions until a Microsoft employee discovered the backdoor while investigating how a recent system upgrade impacted the performance of SSH authentication. Despite its potential global impact, no tooling exists for monitoring and identifying anomalous behavior by personas contributing to other open-source projects. This paper demonstrates how Open Source Intelligence (OSINT) data gathered from GitHub contributions, analyzed using graph databases and graph theory, can efficiently identify anomalous behaviors exhibited by the "JiaT75" persona across other open-source projects.
Related papers
- Analyzing the Availability of E-Mail Addresses for PyPI Libraries [89.21869606965578]
81.6% of libraries include at least one valid e-mail address, with PyPI serving as the primary source.<n>We identify over 698,000 invalid entries, primarily due to missing fields.
arXiv Detail & Related papers (2026-01-20T14:54:58Z) - Launch-Day Diffusion: Tracking Hacker News Impact on GitHub Stars for AI Tools [0.0]
Social news platforms have become key launch outlets for open-source projects, especially Hacker News.<n>This paper presents a reproducible demonstration system that tracks how HN exposure translates into GitHub star growth for AI and LLM tools.
arXiv Detail & Related papers (2025-11-06T15:23:50Z) - Trace: Securing Smart Contract Repository Against Access Control Vulnerability [58.02691083789239]
GitHub hosts numerous smart contract repositories containing source code, documentation, and configuration files.<n>Third-party developers often reference, reuse, or fork code from these repositories during custom development.<n>Existing tools for detecting smart contract vulnerabilities are limited in their ability to handle complex repositories.
arXiv Detail & Related papers (2025-10-22T05:18:28Z) - Eradicating the Unseen: Detecting, Exploiting, and Remediating a Path Traversal Vulnerability across GitHub [1.2124551005857036]
Vulnerabilities in open-source software can cause cascading effects in the modern digital ecosystem.<n>We identified 1,756 vulnerable open-source projects, some of which are very influential.<n>We have responsibly disclosed the vulnerability to the maintainers, and 14% of the reported vulnerabilities have been remediated.
arXiv Detail & Related papers (2025-05-26T16:29:21Z) - Wolves in the Repository: A Software Engineering Analysis of the XZ Utils Supply Chain Attack [0.8517406772939294]
The digital economy runs on Open Source Software (OSS), with an estimated 90% of modern applications containing open-source components.<n>This paper examines a sophisticated attack on the XZUtils project (-2024-3094), where attackers exploited not just code, but the entire open-source development process.<n>Our analysis reveals a new breed of supply chain attack that manipulates software engineering practices themselves.
arXiv Detail & Related papers (2025-04-24T12:06:11Z) - Discovery of Timeline and Crowd Reaction of Software Vulnerability Disclosures [47.435076500269545]
Apache Log4J was found to be vulnerable to remote code execution attacks.
More than 35,000 packages were forced to update their Log4J libraries with the latest version.
It is practically reasonable for software developers to update their third-party libraries whenever the software vendors have released a vulnerable-free version.
arXiv Detail & Related papers (2024-11-12T01:55:51Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - RepoAgent: An LLM-Powered Open-Source Framework for Repository-level
Code Documentation Generation [79.83270415843857]
We introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation.
We have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation.
arXiv Detail & Related papers (2024-02-26T15:39:52Z) - Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [79.31134731122462]
We introduce OpenAct benchmark to evaluate the open-domain task-solving capability, built on human expert consultation and repositories in GitHub.<n>We present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - A Comparative Study of Software Secrets Reporting by Secret Detection
Tools [5.9347272469695245]
According to GitGuardian's monitoring of public GitHub repositories, secrets continued accelerating in 2022 by 67% compared to 2021.
We present an evaluation of five open-source and four proprietary tools against a benchmark dataset.
The top three tools based on precision are: GitHub Secret Scanner (75%), Gitleaks (46%), and Commercial X (25%), and based on recall are: Gitleaks (88%), SpectralOps (67%) and TruffleHog (52%)
arXiv Detail & Related papers (2023-07-03T02:32:09Z) - On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository.
We retrieve over 53k potential vulnerable clones from Maven Central.
We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z) - BackdoorBox: A Python Toolbox for Backdoor Learning [67.53987387581222]
This Python toolbox implements representative and advanced backdoor attacks and defenses.
It allows researchers and developers to easily implement and compare different methods on benchmark or their local datasets.
arXiv Detail & Related papers (2023-02-01T09:45:42Z) - The OCEAN mailing list data set: Network analysis spanning mailing lists
and code repositories [0.0]
We combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present.
To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer with the social layer.
We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects.
arXiv Detail & Related papers (2022-04-01T17:50:15Z) - LAGOON: An Analysis Tool for Open Source Communities [7.3861897382622015]
LAGOON is an open source platform for understanding the ecosystems of Open Source Software (OSS) communities.
LAGOON ingests artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from websites.
A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph.
arXiv Detail & Related papers (2022-01-26T18:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.