Related papers: Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

URL: http://arxiv.org/abs/2403.04419v1
Date: Thu, 7 Mar 2024 11:36:09 GMT
Title: Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub
Authors: Md Rayhanul Masud (University of California, Riverside), Michalis Faloutsos (University of California, Riverside)
Abstract summary: We use ChatGPT to understand and annotate the content published in software repositories. We carry out a systematic study on a collection of 35.2K GitHub repositories claimed to be created for educational purposes only.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Are malicious repositories hiding under the educational label in GitHub? Recent studies have identified collections of GitHub repositories hosting malware source code with notable collaboration among the developers. Thus, analyzing GitHub repositories deserves inevitable attention due to its open-source nature providing easy access to malicious software code and artifacts. Here we leverage the capabilities of ChatGPT in a qualitative study to annotate an educational GitHub repository based on maliciousness of its metadata contents. Our contribution is twofold. First, we demonstrate the employment of ChatGPT to understand and annotate the content published in software repositories. Second, we provide evidence of hidden risk in educational repositories contributing to the opportunities of potential threats and malicious intents. We carry out a systematic study on a collection of 35.2K GitHub repositories claimed to be created for educational purposes only. First, our study finds an increasing trend in the number of such repositories published every year. Second, 9294 of them are labeled by ChatGPT as malicious, and further categorization of the malicious ones detects 14 different malware families including DDoS, keylogger, ransomware and so on. Overall, this exploratory study flags a wake-up call for the community for better understanding and analysis of software platforms.

Related papers

On the Prevalence and Usage of Commit Signing on GitHub: A Longitudinal and Cross-Domain Study [1.834753484317836]
We study the presence of verified commits in GitHub repositories over five years. Only 10% of all the commits in these 60 repositories are verified. We propose ways to identify commit ownership based on GitHub's Events API.
arXiv Detail & Related papers (2025-04-27T12:39:50Z)
4.5 Million (Suspected) Fake Stars in GitHub: A Growing Spiral of Popularity Contests, Scams, and Malware [58.60545935390151]
We present a global, longitudinal measurement study of fake stars in GitHub. We build StarScout, a scalable tool able to detect anomalous starring behaviors. Our study has implications for platform moderators, open-source practitioners, and supply chain security researchers.
arXiv Detail & Related papers (2024-12-18T03:03:58Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories. We propose a new benchmark named DevEval, which has three advances. DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z)
Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits. To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible. In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z)
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues. We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z)
LEGION: Harnessing Pre-trained Language Models for GitHub Topic Recommendations with Distribution-Balance Loss [3.946772434700026]
Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data. This paper proposes Legion, a novel approach that leverages Pre-trained Language Models (PTMs) for recommending topics for GitHub repositories. Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics.
arXiv Detail & Related papers (2024-03-09T10:49:31Z)
How do Software Engineering Researchers Use GitHub? An Empirical Study of Artifacts & Impact [0.2209921757303168]
We ask whether and how authors engage in social coding related to their research. Ten thousand papers in top SE research venues, hand-annotating their GitHub links, and studying 309 paper-related repositories. We find a wide distribution in popularity and impact, some strongly correlated with publication venue.
arXiv Detail & Related papers (2023-10-02T18:56:33Z)
VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model [13.96251273677855]
VulLibGen is a method to directly generate affected packages. It has an average accuracy of 0.806 for identifying vulnerable packages. We have submitted 60 vulnerability, affected package> pairs to GitHub Advisory.
arXiv Detail & Related papers (2023-08-09T02:02:46Z)
On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository. We retrieve over 53k potential vulnerable clones from Maven Central. We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z)
Multifaceted Hierarchical Report Identification for Non-Functional Bugs in Deep Learning Frameworks [5.255197438986675]
We propose MHNurf - an end-to-end tool for automatically identifying non-functional bug related reports in Deep Learning (DL) frameworks. The core of MHNurf is a Multifaceted Hierarchical Attention Network (MHAN) that tackles three unaddressed challenges. MHNurf works the best with a combination of content, comment, and code, which considerably outperforms the classic HAN where only the content is used.
arXiv Detail & Related papers (2022-10-04T18:49:37Z)
GitRank: A Framework to Rank GitHub Repositories [0.0]
Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems. In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria.
arXiv Detail & Related papers (2022-05-04T23:42:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.