Topical: Learning Repository Embeddings from Source Code using Attention
- URL: http://arxiv.org/abs/2208.09495v4
- Date: Sat, 4 Nov 2023 16:56:47 GMT
- Title: Topical: Learning Repository Embeddings from Source Code using Attention
- Authors: Agathe Lherondelle, Varun Babbar, Yash Satsangi, Fran Silavong,
Shaltiel Eloul, Sean Moran
- Abstract summary: This paper presents Topical, a novel deep neural network for repository level embeddings.
The attention mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data.
- Score: 3.110769442802435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Topical, a novel deep neural network for repository level
embeddings. Existing methods, reliant on natural language documentation or
naive aggregation techniques, are outperformed by Topical's utilization of an
attention mechanism. This mechanism generates repository-level representations
from source code, full dependency graphs, and script level textual data.
Trained on publicly accessible GitHub repositories, Topical surpasses multiple
baselines in tasks such as repository auto-tagging, highlighting the attention
mechanism's efficacy over traditional aggregation methods. Topical also
demonstrates scalability and efficiency, making it a valuable contribution to
repository-level representation computation. For further research, the
accompanying tools, code, and training dataset are provided at:
https://github.com/jpmorganchase/topical.
Related papers
- Topological Methods in Machine Learning: A Tutorial for Practitioners [4.297070083645049]
Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data structures.
This tutorial provides a comprehensive introduction to two key TML techniques, persistent homology and the Mapper algorithm.
To enhance accessibility, we adopt a data-centric approach, enabling readers to gain hands-on experience applying these techniques to relevant tasks.
arXiv Detail & Related papers (2024-09-04T17:44:52Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present textbfmethodnamews, a novel benchmark designed to evaluate repository-level code generation.
We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing [82.96523584351314]
We decouple the task of context retrieval from the other components of the repository-level code editing pipelines.
We conclude that while the reasoning helps to improve the precision of the gathered context, it still lacks the ability to identify its sufficiency.
arXiv Detail & Related papers (2024-06-06T19:44:17Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature [0.0]
This study introduces an automated methodology to bridge the gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv.
Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors.
arXiv Detail & Related papers (2024-03-20T17:06:51Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Improving Deep Representation Learning via Auxiliary Learnable Target Coding [69.79343510578877]
This paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning.
Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations.
arXiv Detail & Related papers (2023-05-30T01:38:54Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - MORE: A Metric Learning Based Framework for Open-domain Relation
Extraction [25.149590577718996]
Open relation extraction (OpenRE) is the task of extracting relation schemes from open-domain corpora.
We propose a novel learning framework named MORE (Metric learning-based Open Relation Extraction)
arXiv Detail & Related papers (2022-06-01T07:51:20Z) - Capturing Structural Locality in Non-parametric Language Models [85.94669097485992]
We propose a simple yet effective approach for adding locality information into non-parametric language models.
Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy.
arXiv Detail & Related papers (2021-10-06T15:53:38Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.