Related papers: Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

URL: http://arxiv.org/abs/2107.05112v1
Date: Sun, 11 Jul 2021 18:57:03 GMT
Title: Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity
Authors: Md Omar Faruk Rokon, Pei Yan, Risul Islam, Michalis Faloutsos
Abstract summary: Repo2Vec is a comprehensive embedding approach to represent a repository as a distributed vector. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories.
Score: 2.095199622772379
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determiningrepository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by MLalgorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a)metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93%vs 78%), with nearly twice as many Strongly Similar repositories and 30% fewer False Positives. Second, we show how Repo2Vecprovides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98% precision and 96%recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.

Related papers

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing [86.26405009039868]
BlobCtrl is a framework that unifies element-level generation and editing using a probabilistic blob-based representation. Our approach effectively decouples and represents spatial location, semantic content, and identity information. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency.
arXiv Detail & Related papers (2025-03-17T17:58:05Z)
DependEval: Benchmarking LLMs for Repository Dependency Understanding [16.19185341217556]
Large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. We introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval) Benchmark is based on 15,576 repositories collected from real-world websites.
arXiv Detail & Related papers (2025-03-09T16:45:22Z)
Repository-level Code Search with Neural Retrieval Methods [25.222964965449286]
We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline.
arXiv Detail & Related papers (2025-02-10T21:59:01Z)
SiReRAG: Indexing Similar and Related Information for Multihop Reasoning [96.60045548116584]
SiReRAG is a novel RAG indexing approach that explicitly considers both similar and related information. SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets.
arXiv Detail & Related papers (2024-12-09T04:56:43Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z)
RepoFusion: Training Code Models to Understand Your Repository [12.621282610983592]
Large Language Models (LLMs) in coding assistants like GitHub Copilot struggle to understand the context present in the repository. Recent work has shown the promise of using context from the repository during inference. We propose RepoFusion, a framework to train models to incorporate relevant repository context.
arXiv Detail & Related papers (2023-06-19T15:05:31Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
Topical: Learning Repository Embeddings from Source Code using Attention [3.110769442802435]
This paper presents Topical, a novel deep neural network for repository level embeddings. The attention mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data.
arXiv Detail & Related papers (2022-08-19T18:13:27Z)
Learning Implicit Feature Alignment Function for Semantic Segmentation [51.36809814890326]
Implicit Feature Alignment function (IFA) is inspired by the rapidly expanding topic of implicit neural representations. We show that IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions. Our method can be combined with improvement on various architectures, and it achieves state-of-the-art accuracy trade-off on common benchmarks.
arXiv Detail & Related papers (2022-06-17T09:40:14Z)
Deep Class Incremental Learning from Decentralized Data [103.2386956343121]
We focus on a new and challenging decentralized machine learning paradigm in which there are continuous inflows of data to be addressed. We introduce a paradigm to create a basic decentralized counterpart of typical (centralized) class-incremental learning approaches. We propose a Decentralized Composite knowledge Incremental Distillation framework (DCID) to transfer knowledge from historical models and multiple local sites to the general model continually.
arXiv Detail & Related papers (2022-03-11T15:09:33Z)
Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection. Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)
Weakly Supervised Instance Attention for Multisource Fine-Grained Object Recognition with an Application to Tree Species Classification [9.668407688201361]
We propose a multisource method to classify relatively small objects. The proposed method uses a single-source deep instance attention model with parallel branches for joint localization and classification of objects. We show that all levels of fusion provide higher accuracies compared to the state-of-the-art, with the best performing method of feature-level fusion resulting in 53% accuracy for the recognition of 40 different types of trees.
arXiv Detail & Related papers (2021-05-23T17:51:14Z)
Pairwise Similarity Knowledge Transfer for Weakly Supervised Object Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels. In this work, we argue that learning only an objectness function is a weak form of knowledge transfer. Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.