A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep
Learning for Source Code
- URL: http://arxiv.org/abs/2010.12663v2
- Date: Tue, 27 Apr 2021 15:28:30 GMT
- Title: A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep
Learning for Source Code
- Authors: Nadezhda Chirkova, Sergey Troshin
- Abstract summary: We propose a method, based on identifier anonymization, to handle out-of-vocabulary (OOV) identifiers.
Our method can be treated as a preprocessing step and, therefore, allows for easy implementation.
We show that the proposed OOV anonymization method significantly improves the performance of the Transformer in two code processing tasks.
- Score: 14.904366372190943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an emerging interest in the application of natural language
processing models to source code processing tasks. One of the major problems in
applying deep learning to software engineering is that source code often
contains a lot of rare identifiers, resulting in huge vocabularies. We propose
a simple, yet effective method, based on identifier anonymization, to handle
out-of-vocabulary (OOV) identifiers. Our method can be treated as a
preprocessing step and, therefore, allows for easy implementation. We show that
the proposed OOV anonymization method significantly improves the performance of
the Transformer in two code processing tasks: code completion and bug fixing.
Related papers
- Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.
They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.
We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.
We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.
We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning [11.337238450492546]
We propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning.
NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures.
arXiv Detail & Related papers (2024-08-18T03:47:34Z) - Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines [0.0]
We introduce a dynamic pipeline that transforms natural language task descriptions into code through high-level data-shaping instructions.
This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code.
We propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction.
arXiv Detail & Related papers (2024-03-18T08:58:47Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Exploring Representation-Level Augmentation for Code Search [50.94201167562845]
We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
arXiv Detail & Related papers (2022-10-21T22:47:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Exploiting Method Names to Improve Code Summarization: A Deliberation
Multi-Task Learning Approach [5.577102440028882]
We design a novel multi-task learning (MTL) approach for code summarization.
We first introduce the tasks of generation and informativeness prediction of method names.
A novel two-pass deliberation mechanism is then incorporated into our MTL architecture to generate more consistent intermediate states.
arXiv Detail & Related papers (2021-03-21T17:52:21Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z) - DeepSumm -- Deep Code Summaries using Neural Transformer Architecture [8.566457170664927]
We employ neural techniques to solve the task of source code summarizing.
With supervised samples of more than 2.1m comments and code, we reduce the training time by more than 50%.
arXiv Detail & Related papers (2020-03-31T22:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.