Learning Type Inference for Enhanced Dataflow Analysis
- URL: http://arxiv.org/abs/2310.00673v2
- Date: Wed, 4 Oct 2023 15:15:00 GMT
- Title: Learning Type Inference for Enhanced Dataflow Analysis
- Authors: Lukas Seidel, Sedick David Baker Effendi, Xavier Pinho, Konrad Rieck,
Brink van der Merwe, Fabian Yamaguchi
- Abstract summary: We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations.
Our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark.
We present JoernTI, an integration of our approach into Joern, an open source static analysis tool.
- Score: 6.999203506253375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statically analyzing dynamically-typed code is a challenging endeavor, as
even seemingly trivial tasks such as determining the targets of procedure calls
are non-trivial without knowing the types of objects at compile time.
Addressing this challenge, gradual typing is increasingly added to
dynamically-typed languages, a prominent example being TypeScript that
introduces static typing to JavaScript. Gradual typing improves the developer's
ability to verify program behavior, contributing to robust, secure and
debuggable programs. In practice, however, users only sparsely annotate types
directly. At the same time, conventional type inference faces
performance-related challenges as program size grows. Statistical techniques
based on machine learning offer faster inference, but although recent
approaches demonstrate overall improved accuracy, they still perform
significantly worse on user-defined types than on the most common built-in
types. Limiting their real-world usefulness even more, they rarely integrate
with user-facing applications. We propose CodeTIDAL5, a Transformer-based model
trained to reliably predict type annotations. For effective result retrieval
and re-integration, we extract usage slices from a program's code property
graph. Comparing our approach against recent neural type inference systems, our
model outperforms the current state-of-the-art by 7.85% on the
ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore,
we present JoernTI, an integration of our approach into Joern, an open source
static analysis tool, and demonstrate that the analysis benefits from the
additional type information. As our model allows for fast inference times even
on commodity CPUs, making our system available through Joern leads to high
accessibility and facilitates security research.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Inferring Pluggable Types with Machine Learning [0.3867363075280544]
This paper investigates how to use machine learning to infer type qualifiers automatically.
We propose a novel representation, NaP-AST, that encodes minimal dataflow hints for the effective inference of type qualifiers.
arXiv Detail & Related papers (2024-06-21T22:32:42Z) - Generative Input: Towards Next-Generation Input Methods Paradigm [49.98958865125018]
We propose a novel Generative Input paradigm named GeneInput.
It uses prompts to handle all input scenarios and other intelligent auxiliary input functions, optimizing the model with user feedback to deliver personalized results.
The results demonstrate that we have achieved state-of-the-art performance for the first time in the Full-mode Key-sequence to Characters(FK2C) task.
arXiv Detail & Related papers (2023-11-02T12:01:29Z) - Generative Type Inference for Python [62.01560866916557]
This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis.
TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs)
Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match.
arXiv Detail & Related papers (2023-07-18T11:40:31Z) - TypeT5: Seq2seq Type Inference using Static Analysis [51.153089609654174]
We present a new type inference method that treats type prediction as a code infilling task.
Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model.
We also propose an iterative decoding scheme that incorporates previous type predictions in the model's input context.
arXiv Detail & Related papers (2023-03-16T23:48:00Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Comparative Code Structure Analysis using Deep Learning for Performance
Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure.
Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z) - Advanced Graph-Based Deep Learning for Probabilistic Type Inference [0.8508198765617194]
We introduce a range of graph neural network (GNN) models that operate on a novel type flow graph (TFG) representation.
Our GNN models are trained to predict the type labels in the TFG for a given input program.
We show that our best two GNN configurations for accuracy achieve a top-1 accuracy of 87.76% and 86.89% respectively.
arXiv Detail & Related papers (2020-09-13T08:13:01Z) - LambdaNet: Probabilistic Type Inference using Graph Neural Networks [46.66093127573704]
This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network.
Our approach can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training.
arXiv Detail & Related papers (2020-04-29T17:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.