Does Configuration Encoding Matter in Learning Software Performance? An
Empirical Study on Encoding Schemes
- URL: http://arxiv.org/abs/2203.15988v2
- Date: Fri, 1 Apr 2022 13:32:53 GMT
- Title: Does Configuration Encoding Matter in Learning Software Performance? An
Empirical Study on Encoding Schemes
- Authors: Jingzhi Gong, Tao Chen
- Abstract summary: The study covers five systems, seven models, and three encoding schemes, leading to 105 cases of investigation.
We empirically compared the widely used encoding schemes for software performance learning, namely label, scaled label, and one-hot encoding.
Our key findings reveal that: (1) conducting trial-and-error to find the best encoding scheme in a case by case manner can be rather expensive, requiring up to 400+ hours on some models and systems; (2) the one-hot encoding often leads to the most accurate results while the scaled label encoding is generally weak on accuracy over different models; (3) conversely, the scaled label encoding tends to
- Score: 5.781900408390438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning and predicting the performance of a configurable software system
helps to provide better quality assurance. One important engineering decision
therein is how to encode the configuration into the model built. Despite the
presence of different encoding schemes, there is still little understanding of
which is better and under what circumstances, as the community often relies on
some general beliefs that inform the decision in an ad-hoc manner. To bridge
this gap, in this paper, we empirically compared the widely used encoding
schemes for software performance learning, namely label, scaled label, and
one-hot encoding. The study covers five systems, seven models, and three
encoding schemes, leading to 105 cases of investigation.
Our key findings reveal that: (1) conducting trial-and-error to find the best
encoding scheme in a case by case manner can be rather expensive, requiring up
to 400+ hours on some models and systems; (2) the one-hot encoding often leads
to the most accurate results while the scaled label encoding is generally weak
on accuracy over different models; (3) conversely, the scaled label encoding
tends to result in the fastest training time across the models/systems while
the one-hot encoding is the slowest; (4) for all models studied, label and
scaled label encoding often lead to relatively less biased outcomes between
accuracy and training time, but the paired model varies according to the
system.
We discuss the actionable suggestions derived from our findings, hoping to
provide a better understanding of this topic for the community. To promote open
science, the data and code of this work can be publicly accessed at
https://github.com/ideas-labo/MSR2022-encoding-study.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder.
This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances.
We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z) - An Exploration of Encoder-Decoder Approaches to Multi-Label
Classification for Legal and Biomedical Text [20.100081284294973]
We compare four methods for multi-label classification, two based on an encoder only, and two based on an encoder-decoder.
Our results show that encoder-decoder methods outperform encoder-only methods, with a growing advantage on more complex datasets.
arXiv Detail & Related papers (2023-05-09T17:13:53Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - Learning to Improve Code Efficiency [27.768476489523163]
We analyze a large competitive programming dataset from the Google Code Jam competition.
We find that efficient code is indeed rare, with a 2x difference between the median runtime and the 90th percentile of solutions.
We propose using machine learning to automatically provide prescriptive feedback in the form of hints, to guide programmers towards writing high-performance code.
arXiv Detail & Related papers (2022-08-09T01:28:30Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Rate Coding or Direct Coding: Which One is Better for Accurate, Robust,
and Energy-efficient Spiking Neural Networks? [4.872468969809081]
Spiking Neural Networks (SNNs) works focus on an image classification task, therefore various coding techniques have been proposed to convert an image into temporal binary spikes.
Among them, rate coding and direct coding are regarded as prospective candidates for building a practical SNN system.
We conduct a comprehensive analysis of the two codings from three perspectives: accuracy, adversarial robustness, and energy-efficiency.
arXiv Detail & Related papers (2022-01-31T16:18:07Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.