CodeArt: Better Code Models by Attention Regularization When Symbols Are
Lacking
- URL: http://arxiv.org/abs/2402.11842v1
- Date: Mon, 19 Feb 2024 05:13:22 GMT
- Title: CodeArt: Better Code Models by Attention Regularization When Symbols Are
Lacking
- Authors: Zian Su, Xiangzhe Xu, Ziyang Huang, Zhuo Zhang, Yapeng Ye, Jianjun
Huang, Xiangyu Zhang
- Abstract summary: Transformer based code models have impressive performance in many software engineering tasks.
However, their effectiveness degrades when symbols are missing or not informative.
We propose a new method to pre-train general code models when symbols are lacking.
- Score: 12.458135956476639
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer based code models have impressive performance in many software
engineering tasks. However, their effectiveness degrades when symbols are
missing or not informative. The reason is that the model may not learn to pay
attention to the right correlations/contexts without the help of symbols. We
propose a new method to pre-train general code models when symbols are lacking.
We observe that in such cases, programs degenerate to something written in a
very primitive language. We hence propose to use program analysis to extract
contexts a priori (instead of relying on symbols and masked language modeling
as in vanilla models). We then leverage a novel attention masking method to
only allow the model attending to these contexts, e.g., bi-directional program
dependence transitive closures and token co-occurrences. In the meantime, the
inherent self-attention mechanism is utilized to learn which of the allowed
attentions are more important compared to others. To realize the idea, we
enhance the vanilla tokenization and model architecture of a BERT model,
construct and utilize attention masks, and introduce a new pre-training
algorithm. We pre-train this BERT-like model from scratch, using a dataset of
26 million stripped binary functions with explicit program dependence
information extracted by our tool. We apply the model in three downstream
tasks: binary similarity, type inference, and malware family classification.
Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49%
to 60%, and 74% to 94%, respectively. It also substantially outperforms other
general pre-training techniques of code understanding models.
Related papers
- Model Stealing for Any Low-Rank Language Model [25.16701867917684]
We build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting.
Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution.
This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.
arXiv Detail & Related papers (2024-11-12T04:25:31Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Beyond Self-learned Attention: Mitigating Attention Bias in
Transformer-based Models Using Attention Guidance [9.486558126032639]
We introduce SyntaGuid, a novel approach to guide Transformer-based models towards critical source code tokens.
We show that SyntaGuid can improve overall performance up to 3.25% and fix up to 28.3% wrong predictions.
arXiv Detail & Related papers (2024-02-26T18:03:50Z) - StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention [2.66269503676104]
We introduce a novel fine-tuning method, called cross-attention (StochCA), specific to Transformer architectures.
This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning.
Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas.
arXiv Detail & Related papers (2024-02-25T13:53:49Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Who's Harry Potter? Approximate Unlearning in LLMs [4.821438899378393]
Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content.
This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
We propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
arXiv Detail & Related papers (2023-10-03T17:48:14Z) - Robust Attack Graph Generation [11.419463747286716]
We present a method to learn automaton models that are more robust to input modifications.
It iteratively aligns sequences to a learned model, modifies the sequences to their aligned versions, and re-learns the model.
arXiv Detail & Related papers (2022-06-15T19:26:39Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.