Related papers: VuLASTE: Long Sequence Model with Abstract Syntax Tree Embedding for vulnerability Detection

VuLASTE: Long Sequence Model with Abstract Syntax Tree Embedding for vulnerability Detection

URL: http://arxiv.org/abs/2302.02345v1
Date: Sun, 5 Feb 2023 09:17:02 GMT
Title: VuLASTE: Long Sequence Model with Abstract Syntax Tree Embedding for vulnerability Detection
Authors: Botong Zhu and Huobin Tan
Abstract summary: We build a model named VuLASTE, which regards vulnerability detection as a special text classification task. To solve the vocabulary explosion problem, VuLASTE uses a byte level BPE algorithm from natural language processing. To test our model performance on real-world source code, we build a cross-language and multi-repository vulnerability dataset.
Score: 0.76146285961466
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we build a model named VuLASTE, which regards vulnerability detection as a special text classification task. To solve the vocabulary explosion problem, VuLASTE uses a byte level BPE algorithm from natural language processing. In VuLASTE, a new AST path embedding is added to represent source code nesting information. We also use a combination of global and dilated window attention from Longformer to extract long sequence semantic from source code. To solve the data imbalance problem, which is a common problem in vulnerability detection datasets, focal loss is used as loss function to make model focus on poorly classified cases during training. To test our model performance on real-world source code, we build a cross-language and multi-repository vulnerability dataset from Github Security Advisory Database. On this dataset, VuLASTE achieved top 50, top 100, top 200, top 500 hits of 29, 51, 86, 228, which are higher than state-of-art researches.

Related papers

Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning [0.25602836891933073]
It is difficult to identify the disaster-related posts among the large amounts of unstructured data available. Previous methods often use keyword filtering, topic modelling or classification-based techniques to identify such posts. This study investigates the potential of Active Learning (AL) for identifying disaster-related Tweets.
arXiv Detail & Related papers (2024-08-19T11:40:20Z)
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO. It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z)
WitheredLeaf: Finding Entity-Inconsistency Bugs with LLMs [22.22945885085009]
Entity-Inconsistency Bugs (EIBs) originate from semantic bugs. EIBs are subtle and can remain undetected for years. We introduce a novel, cascaded EIB detection system named WitheredLeaf.
arXiv Detail & Related papers (2024-05-02T18:44:34Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
MeaeQ: Mount Model Extraction Attacks with Efficient Queries [6.1106195466129485]
We study model extraction attacks in natural language processing (NLP) We propose MeaeQ, a straightforward yet effective method to address these issues. MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries.
arXiv Detail & Related papers (2023-10-21T16:07:16Z)
Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks [17.367599062853156]
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets. We propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models.
arXiv Detail & Related papers (2023-07-13T15:05:34Z)
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios. Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers. We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z)
GLEN: General-Purpose Event Detection for Thousands of Types [80.99866527772512]
We build a general-purpose event detection dataset GLEN, which covers 205K event mentions with 3,465 different types. GLEN is 20x larger in ontology than today's largest event dataset. We also propose a new multi-stage event detection model CEDAR specifically designed to handle the large size in GLEN.
arXiv Detail & Related papers (2023-03-16T05:36:38Z)
Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z)
AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree [3.1087379479634927]
We propose the AstBERT model, a pre-trained language model aiming to better understand the programming language (PL) using the abstract syntax tree (AST) Specifically, we collect a colossal amount of source codes (both java and python) from GitHub, in which information of the source codes can be interpreted and integrated. Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks.
arXiv Detail & Related papers (2022-01-20T03:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.