Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories
in GitHub
- URL: http://arxiv.org/abs/2403.04419v1
- Date: Thu, 7 Mar 2024 11:36:09 GMT
- Title: Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories
in GitHub
- Authors: Md Rayhanul Masud (University of California, Riverside), Michalis
Faloutsos (University of California, Riverside)
- Abstract summary: We use ChatGPT to understand and annotate the content published in software repositories.
We carry out a systematic study on a collection of 35.2K GitHub repositories claimed to be created for educational purposes only.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are malicious repositories hiding under the educational label in GitHub?
Recent studies have identified collections of GitHub repositories hosting
malware source code with notable collaboration among the developers. Thus,
analyzing GitHub repositories deserves inevitable attention due to its
open-source nature providing easy access to malicious software code and
artifacts. Here we leverage the capabilities of ChatGPT in a qualitative study
to annotate an educational GitHub repository based on maliciousness of its
metadata contents. Our contribution is twofold. First, we demonstrate the
employment of ChatGPT to understand and annotate the content published in
software repositories. Second, we provide evidence of hidden risk in
educational repositories contributing to the opportunities of potential threats
and malicious intents. We carry out a systematic study on a collection of 35.2K
GitHub repositories claimed to be created for educational purposes only. First,
our study finds an increasing trend in the number of such repositories
published every year. Second, 9294 of them are labeled by ChatGPT as malicious,
and further categorization of the malicious ones detects 14 different malware
families including DDoS, keylogger, ransomware and so on. Overall, this
exploratory study flags a wake-up call for the community for better
understanding and analysis of software platforms.
Related papers
- Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z) - Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits.
To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible.
In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z) - MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z) - LEGION: Harnessing Pre-trained Language Models for GitHub Topic
Recommendations with Distribution-Balance Loss [3.946772434700026]
Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data.
This paper proposes Legion, a novel approach that leverages Pre-trained Language Models (PTMs) for recommending topics for GitHub repositories.
Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics.
arXiv Detail & Related papers (2024-03-09T10:49:31Z) - How do Software Engineering Researchers Use GitHub? An Empirical Study of Artifacts & Impact [0.2209921757303168]
We ask whether and how authors engage in social coding related to their research.
Ten thousand papers in top SE research venues, hand-annotating their GitHub links, and studying 309 paper-related repositories.
We find a wide distribution in popularity and impact, some strongly correlated with publication venue.
arXiv Detail & Related papers (2023-10-02T18:56:33Z) - VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model [13.96251273677855]
VulLibGen is a method to directly generate affected packages.
It has an average accuracy of 0.806 for identifying vulnerable packages.
We have submitted 60 vulnerability, affected package> pairs to GitHub Advisory.
arXiv Detail & Related papers (2023-08-09T02:02:46Z) - On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository.
We retrieve over 53k potential vulnerable clones from Maven Central.
We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z) - Multifaceted Hierarchical Report Identification for Non-Functional Bugs
in Deep Learning Frameworks [5.255197438986675]
We propose MHNurf - an end-to-end tool for automatically identifying non-functional bug related reports in Deep Learning (DL) frameworks.
The core of MHNurf is a Multifaceted Hierarchical Attention Network (MHAN) that tackles three unaddressed challenges.
MHNurf works the best with a combination of content, comment, and code, which considerably outperforms the classic HAN where only the content is used.
arXiv Detail & Related papers (2022-10-04T18:49:37Z) - GitRank: A Framework to Rank GitHub Repositories [0.0]
Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems.
In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria.
arXiv Detail & Related papers (2022-05-04T23:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.