An Empirical Study on Data Leakage and Generalizability of Link
Prediction Models for Issues and Commits
- URL: http://arxiv.org/abs/2211.00381v2
- Date: Mon, 24 Apr 2023 11:01:55 GMT
- Title: An Empirical Study on Data Leakage and Generalizability of Link
Prediction Models for Issues and Commits
- Authors: Maliheh Izadi, Pooya Rostami Mazrae, Tom Mens, Arie van Deursen
- Abstract summary: LinkFormer preserves and improves the accuracy of existing predictions.
Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data.
- Score: 7.061740334417124
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To enhance documentation and maintenance practices, developers conventionally
establish links between related software artifacts manually. Empirical research
has revealed that developers frequently overlook this practice, resulting in
significant information loss. To address this issue, automatic link recovery
techniques have been proposed. However, these approaches primarily focused on
improving prediction accuracy on randomly-split datasets, with limited
attention given to the impact of data leakage and the generalizability of the
predictive models. LinkFormer seeks to address these limitations. Our approach
not only preserves and improves the accuracy of existing predictions but also
enhances their alignment with real-world settings and their generalizability.
First, to better utilize contextual information for prediction, we employ the
Transformer architecture and fine-tune multiple pre-trained models on both
textual and metadata information of issues and commits. Next, to gauge the
effect of time on model performance, we employ two splitting policies during
both the training and testing phases; randomly- and temporally-split datasets.
Finally, in pursuit of a generic model that can demonstrate high performance
across a range of projects, we undertake additional fine-tuning of LinkFormer
within two distinct transfer-learning settings. Our findings support that to
simulate real-world scenarios effectively, researchers must maintain the
temporal flow of data when training models. Furthermore, the results
demonstrate that LinkFormer outperforms existing methodologies by a significant
margin, achieving a 48% improvement in F1-measure within a project-based
setting. Finally, the performance of LinkFormer in the cross-project setting is
comparable to its average performance within the project-based scenario.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Building Resilience to Out-of-Distribution Visual Data via Input
Optimization and Model Finetuning [13.804184845195296]
We propose a preprocessing model that learns to optimise input data for a specific target vision model.
We investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles.
We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model.
arXiv Detail & Related papers (2022-11-29T14:06:35Z) - Investigating Enhancements to Contrastive Predictive Coding for Human
Activity Recognition [7.086647707011785]
Contrastive Predictive Coding (CPC) is a technique that learns effective representations by leveraging properties of time-series data.
In this work, we propose enhancements to CPC, by systematically investigating the architecture, the aggregator network, and the future timestep prediction.
Our method shows substantial improvements on four of six target datasets, demonstrating its ability to empower a wide range of application scenarios.
arXiv Detail & Related papers (2022-11-11T12:54:58Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Realistic Re-evaluation of Knowledge Graph Completion Methods: An
Experimental Study [0.0]
This paper is the first systematic study with the main objective of assessing the true effectiveness of embedding models.
Our experiment results show these models are much less accurate than what we used to perceive.
arXiv Detail & Related papers (2020-03-18T01:18:09Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.