Related papers: Understanding the Process of Data Labeling in Cybersecurity

Understanding the Process of Data Labeling in Cybersecurity

URL: http://arxiv.org/abs/2311.16388v1
Date: Tue, 28 Nov 2023 00:20:07 GMT
Title: Understanding the Process of Data Labeling in Cybersecurity
Authors: Tobias Braun, Irdin Pekaric, Giovanni Apruzzese,
Abstract summary: In cyberthreat detection, high-quality data is hard to come by. For some specific applications of Machine Learning, such data must be labeled by human operators. We build a bridge between academic research and security practice in the context of data labeling.
Score: 4.611436679049889
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works "assume" that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling 'from the viewpoint of ML security practitioners'. This is a problem: to this date, it is still mostly unknown how labeling is done in practice -- thereby preventing one from pinpointing "what is needed" in the real world. In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies, and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.

Related papers

QualiTagger: Automating software quality detection in issue trackers [4.917423556150366]
This research uses cutting edge models like Transformers to identify what text is usually associated with different quality properties. We also study the distribution of such qualities in issue trackers from openly accessible software repositories.
arXiv Detail & Related papers (2025-04-15T10:40:40Z)
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
Unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. Our method SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z)
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving [86.04158840879727]
We develop a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions. We then have it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans.
arXiv Detail & Related papers (2024-05-20T17:45:26Z)
KeNet:Knowledge-enhanced Doc-Label Attention Network for Multi-label text classification [12.383260095788042]
Multi-Label Text Classification (MLTC) is a fundamental task in the field of Natural Language Processing (NLP) We design an Attention Network that incorporates external knowledge, label embedding, and a comprehensive attention mechanism. Our approach has been validated by comprehensive research conducted on three multi-label datasets.
arXiv Detail & Related papers (2024-03-04T06:52:19Z)
Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z)
A Survey of Label-Efficient Deep Learning for 3D Point Clouds [109.07889215814589]
This paper presents the first comprehensive survey of label-efficient learning of point clouds. We propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels. For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges.
arXiv Detail & Related papers (2023-05-31T12:54:51Z)
Supporting the Task-driven Skill Identification in Open Source Project Issue Tracking Systems [0.0]
We investigate the automatic labeling of open issues strategy to help the contributors to pick a task to contribute. By identifying the skills, we claim the contributor candidates should pick a task more suitable. We applied quantitative studies to analyze the relevance of the labels in an experiment and compare the strategies' relative importance.
arXiv Detail & Related papers (2022-11-02T14:17:22Z)
A Survey on Extreme Multi-label Learning [72.8751573611815]
Multi-label learning has attracted significant attention from both academic and industry field in recent decades. It is infeasible to directly adapt them to extremely large label space because of the compute and memory overhead. eXtreme Multi-label Learning (XML) is becoming an important task and many effective approaches are proposed.
arXiv Detail & Related papers (2022-10-08T08:31:34Z)
"Garbage In, Garbage Out" Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data? [0.0]
Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent 'best practices' around labeling training data were followed in applied ML publications.
arXiv Detail & Related papers (2021-07-05T21:24:02Z)
Active Learning for Noisy Data Streams Using Weak and Strong Labelers [3.9370369973510746]
We consider a novel weak and strong labeler problem inspired by humans natural ability for labeling. We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection. We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence.
arXiv Detail & Related papers (2020-10-27T09:18:35Z)
Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise [21.491392581672198]
We present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER) We demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
arXiv Detail & Related papers (2020-10-16T14:21:19Z)
Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models. Self-training serves as an effective mechanism to learn from large amounts of unlabeled data. meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z)
Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.