Repo2Vec: A Comprehensive Embedding Approach for Determining Repository
Similarity
- URL: http://arxiv.org/abs/2107.05112v1
- Date: Sun, 11 Jul 2021 18:57:03 GMT
- Title: Repo2Vec: A Comprehensive Embedding Approach for Determining Repository
Similarity
- Authors: Md Omar Faruk Rokon, Pei Yan, Risul Islam, Michalis Faloutsos
- Abstract summary: Repo2Vec is a comprehensive embedding approach to represent a repository as a distributed vector.
We evaluate our method with two real datasets from GitHub for a combined 1013 repositories.
- Score: 2.095199622772379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can we identify similar repositories and clusters among a large online
archive, such as GitHub? Determiningrepository similarity is an essential
building block in studying the dynamics and the evolution of such software
ecosystems. The key challenge is to determine the right representation for the
diverse repository features in a way that: (a) it captures all aspects of the
available information, and (b) it is readily usable by MLalgorithms. We propose
Repo2Vec, a comprehensive embedding approach to represent a repository as a
distributed vector by combining features from three types of information
sources. As our key novelty, we consider three types of information:
(a)metadata, (b) the structure of the repository, and (c) the source code. We
also introduce a series of embedding approaches to represent and combine these
information types into a single embedding. We evaluate our method with two real
datasets from GitHub for a combined 1013 repositories. First, we show that our
method outperforms previous methods in terms of precision (93%vs 78%), with
nearly twice as many Strongly Similar repositories and 30% fewer False
Positives. Second, we show how Repo2Vecprovides a solid basis for: (a)
distinguishing between malware and benign repositories, and (b) identifying a
meaningful hierarchical clustering. For example, we achieve 98% precision and
96%recall in distinguishing malware and benign repositories. Overall, our work
is a fundamental building block for enabling many repository analysis functions
such as repository categorization by target platform or intention, detecting
code-reuse and clones, and identifying lineage and evolution.
Related papers
- How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - RepoFusion: Training Code Models to Understand Your Repository [12.621282610983592]
Large Language Models (LLMs) in coding assistants like GitHub Copilot struggle to understand the context present in the repository.
Recent work has shown the promise of using context from the repository during inference.
We propose RepoFusion, a framework to train models to incorporate relevant repository context.
arXiv Detail & Related papers (2023-06-19T15:05:31Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Topical: Learning Repository Embeddings from Source Code using Attention [3.110769442802435]
This paper presents Topical, a novel deep neural network for repository level embeddings.
The attention mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data.
arXiv Detail & Related papers (2022-08-19T18:13:27Z) - Learning Implicit Feature Alignment Function for Semantic Segmentation [51.36809814890326]
Implicit Feature Alignment function (IFA) is inspired by the rapidly expanding topic of implicit neural representations.
We show that IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions.
Our method can be combined with improvement on various architectures, and it achieves state-of-the-art accuracy trade-off on common benchmarks.
arXiv Detail & Related papers (2022-06-17T09:40:14Z) - Deep Class Incremental Learning from Decentralized Data [103.2386956343121]
We focus on a new and challenging decentralized machine learning paradigm in which there are continuous inflows of data to be addressed.
We introduce a paradigm to create a basic decentralized counterpart of typical (centralized) class-incremental learning approaches.
We propose a Decentralized Composite knowledge Incremental Distillation framework (DCID) to transfer knowledge from historical models and multiple local sites to the general model continually.
arXiv Detail & Related papers (2022-03-11T15:09:33Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - Weakly Supervised Instance Attention for Multisource Fine-Grained Object
Recognition with an Application to Tree Species Classification [9.668407688201361]
We propose a multisource method to classify relatively small objects.
The proposed method uses a single-source deep instance attention model with parallel branches for joint localization and classification of objects.
We show that all levels of fusion provide higher accuracies compared to the state-of-the-art, with the best performing method of feature-level fusion resulting in 53% accuracy for the recognition of 40 different types of trees.
arXiv Detail & Related papers (2021-05-23T17:51:14Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.