Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised
Learning for Software Analytics
- URL: http://arxiv.org/abs/2302.01997v1
- Date: Fri, 3 Feb 2023 20:59:09 GMT
- Title: Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised
Learning for Software Analytics
- Authors: Huy Tu and Tim Menzies
- Abstract summary: Semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data.
This paper argues that such strong'' algorithms perform better than those standard, weaker, SSL algorithms.
- Score: 31.13621632964345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many domains, there are many examples and far fewer labels for those
examples; e.g. we may have access to millions of lines of source code, but
access to only a handful of warnings about that code. In those domains,
semi-supervised learners (SSL) can extrapolate labels from a small number of
examples to the rest of the data. Standard SSL algorithms use ``weak''
knowledge (i.e. those not based on specific SE knowledge) such as (e.g.)
co-train two learners and use good labels from one to train the other. Another
approach of SSL in software analytics is potentially use ``strong'' knowledge
that use SE knowledge. For example, an often-used heuristic in SE is that
unusually large artifacts contain undesired properties (e.g. more bugs). This
paper argues that such ``strong'' algorithms perform better than those
standard, weaker, SSL algorithms. We show this by learning models from labels
generated using weak SSL or our ``stronger'' FRUGAL algorithm. In four domains
(distinguishing security-related bug reports; mitigating bias in
decision-making; predicting issue close time; and (reducing false alarms in
static code warnings), FRUGAL required only 2.5% of the data to be labeled yet
out-performed standard semi-supervised learners that relied on (e.g.) some
domain-independent graph theory concepts. Hence, for future work, we strongly
recommend the use of strong heuristics for semi-supervised learning for SE
applications. To better support other researchers, our scripts and data are
on-line at https://github.com/HuyTu7/FRUGAL.
Related papers
- Active Self-Supervised Learning: A Few Low-Cost Relationships Are All
You Need [34.013568381942775]
Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data.
In this work, we formalize and generalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples.
First, it unveils a theoretically grounded learning framework beyond SSL, based on similarity graphs, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle.
Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline.
arXiv Detail & Related papers (2023-03-27T14:44:39Z) - A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends [82.64268080902742]
Self-supervised learning (SSL) aims to learn discriminative features from unlabeled data without relying on human-annotated labels.
SSL has garnered significant attention recently, leading to the development of numerous related algorithms.
This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions.
arXiv Detail & Related papers (2023-01-13T14:41:05Z) - OpenLDN: Learning to Discover Novel Classes for Open-World
Semi-Supervised Learning [110.40285771431687]
Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning.
Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data.
This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes.
arXiv Detail & Related papers (2022-07-05T18:51:05Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Robust Deep Semi-Supervised Learning: A Brief Introduction [63.09703308309176]
Semi-supervised learning (SSL) aims to improve learning performance by leveraging unlabeled data when labels are insufficient.
SSL with deep models has proven to be successful on standard benchmark tasks.
However, they are still vulnerable to various robustness threats in real-world applications.
arXiv Detail & Related papers (2022-02-12T04:16:41Z) - Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance.
Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations.
We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z) - FRUGAL: Unlocking SSL for Software Analytics [17.63040340961143]
Unsupervised learning is a promising direction to learn hidden patterns within unlabelled data.
We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme.
arXiv Detail & Related papers (2021-08-22T21:15:27Z) - Analysis of label noise in graph-based semi-supervised learning [2.4366811507669124]
In machine learning, one must acquire labels to help supervise a model that will be able to generalize to unseen data.
It is often the case that most of our data is unlabeled.
Semi-supervised learning (SSL) alleviates that by making strong assumptions about the relation between the labels and the input data distribution.
arXiv Detail & Related papers (2020-09-27T22:13:20Z) - Self-supervised Learning on Graphs: Deep Insights and New Direction [66.78374374440467]
Self-supervised learning (SSL) aims to create domain specific pretext tasks on unlabeled data.
There are increasing interests in generalizing deep learning to the graph domain in the form of graph neural networks (GNNs)
arXiv Detail & Related papers (2020-06-17T20:30:04Z) - NeuCrowd: Neural Sampling Network for Representation Learning with
Crowdsourced Labels [19.345894148534335]
We propose emphNeuCrowd, a unified framework for supervised representation learning (SRL) from crowdsourced labels.
The proposed framework is evaluated on both one synthetic and three real-world data sets.
arXiv Detail & Related papers (2020-03-21T13:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.