Exploring Software Naturalness through Neural Language Models
- URL: http://arxiv.org/abs/2006.12641v2
- Date: Wed, 24 Jun 2020 13:55:50 GMT
- Title: Exploring Software Naturalness through Neural Language Models
- Authors: Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui
Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost,
Yufan Zhuang, Giacomo Domeniconi
- Abstract summary: The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
- Score: 56.1315223210742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Software Naturalness hypothesis argues that programming languages can be
understood through the same techniques used in natural language processing. We
explore this hypothesis through the use of a pre-trained transformer-based
language model to perform code analysis tasks. Present approaches to code
analysis depend heavily on features derived from the Abstract Syntax Tree (AST)
while our transformer-based language models work on raw source code. This work
is the first to investigate whether such language models can discover AST
features automatically. To achieve this, we introduce a sequence labeling task
that directly probes the language models understanding of AST. Our results show
that transformer based language models achieve high accuracy in the AST tagging
task. Furthermore, we evaluate our model on a software vulnerability
identification task. Importantly, we show that our approach obtains
vulnerability identification results comparable to graph based approaches that
rely heavily on compilers for feature extraction.
Related papers
- Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - Transformer-Based Language Models for Software Vulnerability Detection:
Performance, Model's Security and Platforms [21.943263073426646]
We study how good are the large transformer-based language models detecting software vulnerabilities.
We perform the model's security check using Microsoft's Counterfit, a command-line tool.
We present our recommendation while choosing the platforms to run these large models.
arXiv Detail & Related papers (2022-04-07T04:57:42Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements.
We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - Exploring Neural Models for Parsing Natural Language into First-Order
Logic [10.62143644603835]
We study the capability of neural models in parsing English sentences to First-Order Logic (FOL)
We model FOL parsing as a sequence to sequence mapping task where given a natural language sentence, it is encoded into an intermediate representation using an LSTM followed by a decoder which sequentially generates the predicates in the corresponding FOL formula.
arXiv Detail & Related papers (2020-02-16T09:22:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.