Related papers: An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits

An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits

URL: http://arxiv.org/abs/2211.00381v2
Date: Mon, 24 Apr 2023 11:01:55 GMT
Title: An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits
Authors: Maliheh Izadi, Pooya Rostami Mazrae, Tom Mens, Arie van Deursen
Abstract summary: LinkFormer preserves and improves the accuracy of existing predictions. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data.
Score: 7.061740334417124
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on both textual and metadata information of issues and commits. Next, to gauge the effect of time on model performance, we employ two splitting policies during both the training and testing phases; randomly- and temporally-split datasets. Finally, in pursuit of a generic model that can demonstrate high performance across a range of projects, we undertake additional fine-tuning of LinkFormer within two distinct transfer-learning settings. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data when training models. Furthermore, the results demonstrate that LinkFormer outperforms existing methodologies by a significant margin, achieving a 48% improvement in F1-measure within a project-based setting. Finally, the performance of LinkFormer in the cross-project setting is comparable to its average performance within the project-based scenario.

Related papers

Data Scaling Laws for End-to-End Autonomous Driving [83.85463296830743]
We evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours. Specifically, we investigate how much additional training data is needed to achieve a target performance gain.
arXiv Detail & Related papers (2025-04-06T03:23:48Z)
Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting. Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server. We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z)
Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL) It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z)
Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z)
Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data. One key challenge in federated learning is to handle non-identically distributed data across the clients. We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z)
Building Resilience to Out-of-Distribution Visual Data via Input Optimization and Model Finetuning [13.804184845195296]
We propose a preprocessing model that learns to optimise input data for a specific target vision model. We investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles. We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model.
arXiv Detail & Related papers (2022-11-29T14:06:35Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
DRFLM: Distributionally Robust Federated Learning with Inter-client Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. We propose a general framework to solve the above two challenges simultaneously. We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z)
Realistic Re-evaluation of Knowledge Graph Completion Methods: An Experimental Study [0.0]
This paper is the first systematic study with the main objective of assessing the true effectiveness of embedding models. Our experiment results show these models are much less accurate than what we used to perceive.
arXiv Detail & Related papers (2020-03-18T01:18:09Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.