Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies
- URL: http://arxiv.org/abs/2506.06247v1
- Date: Fri, 06 Jun 2025 17:15:59 GMT
- Title: Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies
- Authors: Sedick David Baker Effendi, Xavier Pinho, Andrei Michael Dreyer, Fabian Yamaguchi,
- Abstract summary: This paper presents the design and implementation of a system for a language-agnostic data-dependence representation.<n>We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.
- Score: 0.42855555838080833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines. This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.
Related papers
- Scaling Inter-procedural Dataflow Analysis on the Cloud [19.562864760293955]
We develop a distributed framework called BigDataflow running on a large-scale cluster.<n>BigDataflow can finish analyzing the program of millions lines of code in minutes.
arXiv Detail & Related papers (2024-12-17T06:18:56Z) - Research on the Application of Spark Streaming Real-Time Data Analysis System and large language model Intelligent Agents [1.4582633500696451]
This study explores the integration of Agent AI with LangGraph to enhance real-time data analysis systems in big data environments.<n>The proposed framework overcomes limitations of static, inefficient stateful computations, and lack of human intervention.<n>System architecture incorporates Apache Spark Streaming, Kafka, and LangGraph to create a high-performance sentiment analysis system.
arXiv Detail & Related papers (2024-12-10T05:51:11Z) - Analyzing Logs of Large-Scale Software Systems using Time Curves Visualization [0.0]
We show that our approach can explain the main events in logs collected from different applications without prior knowledge.<n>As a result, we expect a significant reduction of the time required to identify performance bottlenecks and security risks.
arXiv Detail & Related papers (2024-11-08T12:42:45Z) - GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models [58.08177466768262]
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.
We introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously.
Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin.
arXiv Detail & Related papers (2024-06-20T17:57:51Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework.
We decompose the problem into several subtasks and introduce a series of novel strategies.
On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z) - A Unified Active Learning Framework for Annotating Graph Data with
Application to Software Source Code Performance Prediction [4.572330678291241]
We develop a unified active learning framework specializing in software performance prediction.
We investigate the impact of using different levels of information for active and passive learning.
Our approach aims to improve the investment in AI models for different software performance predictions.
arXiv Detail & Related papers (2023-04-06T14:00:48Z) - Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding.
We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z) - Iterative Rule Extension for Logic Analysis of Data: an MILP-based
heuristic to derive interpretable binary classification from large datasets [0.6526824510982799]
This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics.
The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes.
arXiv Detail & Related papers (2021-10-25T13:31:30Z) - Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using
Graph Propagation [52.9168275057997]
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs.
We show that Enel is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
arXiv Detail & Related papers (2021-08-27T10:21:08Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - Neural Language Modeling for Contextualized Temporal Graph Generation [49.21890450444187]
This paper presents the first study on using large-scale pre-trained language models for automated generation of an event-level temporal graph for a document.
arXiv Detail & Related papers (2020-10-20T07:08:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.