VarCLR: Variable Semantic Representation Pre-training via Contrastive
Learning
- URL: http://arxiv.org/abs/2112.02650v1
- Date: Sun, 5 Dec 2021 18:40:32 GMT
- Title: VarCLR: Variable Semantic Representation Pre-training via Contrastive
Learning
- Authors: Qibin Chen, Jeremy Lacomis, Edward J. Schwartz, Graham Neubig, Bogdan
Vasilescu, Claire Le Goues
- Abstract summary: VarCLR is a new approach for learning semantic representations of variable names.
VarCLR is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs.
We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT.
- Score: 84.70916463298109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variable names are critical for conveying intended program behavior. Machine
learning-based program analysis methods use variable name representations for a
wide range of tasks, such as suggesting new variable names and bug detection.
Ideally, such methods could capture semantic relationships between names beyond
syntactic similarity, e.g., the fact that the names average and mean are
similar. Unfortunately, previous work has found that even the best of previous
representation approaches primarily capture relatedness (whether two variables
are linked at all), rather than similarity (whether they actually have the same
meaning).
We propose VarCLR, a new approach for learning semantic representations of
variable names that effectively captures variable similarity in this stricter
sense. We observe that this problem is an excellent fit for contrastive
learning, which aims to minimize the distance between explicitly similar
inputs, while maximizing the distance between dissimilar inputs. This requires
labeled training data, and thus we construct a novel, weakly-supervised
variable renaming dataset mined from GitHub edits. We show that VarCLR enables
the effective application of sophisticated, general-purpose language models
like BERT, to variable name representation and thus also to related downstream
tasks like variable name similarity search or spelling correction. VarCLR
produces models that significantly outperform the state-of-the-art on IdBench,
an existing benchmark that explicitly captures variable similarity (as distinct
from relatedness). Finally, we contribute a release of all data, code, and
pre-trained models, aiming to provide a drop-in replacement for variable
representations used in either existing or future program analyses that rely on
variable names.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Data-driven path collective variables [0.0]
We propose a new method for the generation, optimization, and comparison of collective variables.
The resulting collective variable is one-dimensional, interpretable, and differentiable.
We demonstrate the validity of the method on two different applications.
arXiv Detail & Related papers (2023-12-21T14:07:47Z) - Scalable variable selection for two-view learning tasks with projection
operators [0.0]
We propose a novel variable selection method for two-view settings, or for vector-valued supervised learning problems.
Our framework is able to handle extremely large scale selection tasks, where number of data samples could be even millions.
arXiv Detail & Related papers (2023-07-04T08:22:05Z) - Scalable Neural Symbolic Regression using Control Variables [7.725394912527969]
We propose ScaleSR, a scalable symbolic regression model that leverages control variables to enhance both accuracy and scalability.
The proposed method involves a four-step process. First, we learn a data generator from observed data using deep neural networks (DNNs)
Experimental results demonstrate that the proposed ScaleSR significantly outperforms state-of-the-art baselines in discovering mathematical expressions with multiple variables.
arXiv Detail & Related papers (2023-06-07T18:30:25Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Be More Active! Understanding the Differences between Mean and Sampled
Representations of Variational Autoencoders [6.68999512375737]
The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications.
Their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart.
We show that passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones.
arXiv Detail & Related papers (2021-09-26T19:04:57Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - $\ell_0$-based Sparse Canonical Correlation Analysis [7.073210405344709]
Canonical Correlation Analysis (CCA) models are powerful for studying the associations between two sets of variables.
Despite their success, CCA models may break if the number of variables in either of the modalities exceeds the number of samples.
Here, we propose $ell_0$-CCA, a method for learning correlated representations based on sparse subsets of two observed modalities.
arXiv Detail & Related papers (2020-10-12T11:44:15Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.