Representation of Developer Expertise in Open Source Software
- URL: http://arxiv.org/abs/2005.10176v3
- Date: Tue, 2 Feb 2021 11:43:45 GMT
- Title: Representation of Developer Expertise in Open Source Software
- Authors: Tapajit Dey, Andrey Karnauch, Audris Mockus
- Abstract summary: We use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers.
We then employ Doc2Vec embeddings for vector representations of APIs, developers, and projects.
We evaluate if these embeddings reflect the postulated topology of the Skill Space.
- Score: 12.583969739954526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Accurate representation of developer expertise has always been an
important research problem. While a number of studies proposed novel methods of
representing expertise within individual projects, these methods are difficult
to apply at an ecosystem level. However, with the focus of software development
shifting from monolithic to modular, a method of representing developers'
expertise in the context of the entire OSS development becomes necessary when,
for example, a project tries to find new maintainers and look for developers
with relevant skills. Aim: We aim to address this knowledge gap by proposing
and constructing the Skill Space where each API, developer, and project is
represented and postulate how the topology of this space should reflect what
developers know (and projects need). Method: we use the World of Code
infrastructure to extract the complete set of APIs in the files changed by open
source developers and, based on that data, employ Doc2Vec embeddings for vector
representations of APIs, developers, and projects. We then evaluate if these
embeddings reflect the postulated topology of the Skill Space by predicting
what new APIs/projects developers use/join, and whether or not their pull
requests get accepted. We also check how the developers' representations in the
Skill Space align with their self-reported API expertise. Result: Our results
suggest that the proposed embeddings in the Skill Space appear to satisfy the
postulated topology and we hope that such representations may aid in the
construction of signals that increase trust (and efficiency) of open source
ecosystems at large and may aid investigations of other phenomena related to
developer proficiency and learning.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Knowledge Islands: Visualizing Developers Knowledge Concentration [0.0]
Knowledge Islands is a tool that visualizes the concentration of knowledge in a software repository using a state-of-the-art knowledge model.
It enables practitioners to analyze GitHub projects, determine where knowledge is concentrated, and implement measures to maintain project health.
arXiv Detail & Related papers (2024-08-16T13:32:49Z) - The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources [100.23208165760114]
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications.
To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet.
arXiv Detail & Related papers (2024-06-24T15:55:49Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - Enhancing API Documentation through BERTopic Modeling and Summarization [0.0]
This paper focuses on the complexities of interpreting Application Programming Interface (API) documentation.
Official API documentation serves as a primary source of information for developers, but it can often be extensive and lacks user-friendliness.
Our novel approach employs the strengths of BERTopic for topic modeling and Natural Language Processing (NLP) to automatically generate summaries of API documentation.
arXiv Detail & Related papers (2023-08-17T15:57:12Z) - Code Recommendation for Open Source Software Developers [32.181023933552694]
CODER is a novel graph-based code recommendation framework for open source software developers.
Our framework achieves superior performance under various experimental settings, including intra-project, cross-project, and cold-start recommendation.
arXiv Detail & Related papers (2022-10-15T16:40:36Z) - Dev2vec: Representing Domain Expertise of Developers in an Embedding
Space [10.321562340915406]
We employ doc2vec to represent the domain expertise of developers as embedding vectors.
These vectors are derived from different sources that contain evidence of developers' expertise.
Our results indicate that encoding the expertise of developers in an embedding vector outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-07-11T18:56:49Z) - Enabling collaborative data science development with the Ballet
framework [9.424574945499844]
We present a novel conceptual framework and ML programming model to address challenges to scaling data science collaborations.
We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science.
arXiv Detail & Related papers (2020-12-14T18:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.