ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs
- URL: http://arxiv.org/abs/2512.12507v1
- Date: Sun, 14 Dec 2025 01:11:11 GMT
- Title: ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs
- Authors: Jaid Monwar Chowdhury, Ahmad Farhan Shahriar Chowdhury, Humayra Binte Monwar, Mahmuda Naznin,
- Abstract summary: This paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG)<n>ATLAS provides a practical foundation for improving downstream Software Engineering (SE) and machine-learning-based program understanding.
- Score: 1.0499611180329804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing complexity of modern software systems has highlighted the shortcomings of traditional programming analysis techniques, particularly for Software Engineering (SE) tasks. While machine learning and Large Language Models (LLMs) offer promising solutions, their effectiveness is limited by the way they interpret data. Unlike natural language, source code meaning is defined less by token adjacency and more by complex, long-range, and structural relationships and dependencies. This limitation is especially pronounced for C and C++, where flatter syntactic hierarchies, pointer aliasing, multi-level indirection, typedef-based type obfuscation, and function-pointer calls hinder accurate static analysis. To address these challenges, this paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that (i) generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG) that capture inter-functional dependencies for the entire program; (ii) has the ability to work on entire C and C++ projects comprising multiple files; (iii) works on both compilable and non-compilable code and (iv) produces a unified multi-view code representation using Abstract Syntax Trees (AST), CFG and DFG. By preserving essential structural and semantic information, ATLAS provides a practical foundation for improving downstream SE and machine-learning-based program understanding. Video demonstration: https://youtu.be/RACWQe5ELwY Tool repository: https://github.com/jaid-monwar/ATLAS-code-representation-tool
Related papers
- LogicLens: Leveraging Semantic Code Graph to explore Multi Repository large systems [0.2519906683279152]
We introduce LogicLens, a reactive conversational agent that assists developers in exploring complex software systems.<n>We present the architecture of the system, discuss emergent behaviors, and evaluate its effectiveness on real-world multi-repository scenarios.
arXiv Detail & Related papers (2026-01-15T15:35:23Z) - InfCode-C++: Intent-Guided Semantic Retrieval and AST-Structured Search for C++ Issue Resolution [31.437457217953835]
We introduce INFCODE-C++, the first C++-aware autonomous system for end-to-end issue resolution.<n>The system combines two complementary retrieval mechanisms -- semantic code-intent retrieval and deterministic AST-structured querying.<n>It achieves a resolution rate of 25.58%, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE-agent.
arXiv Detail & Related papers (2025-11-20T03:05:26Z) - Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams [54.528185120850274]
We propose a novel step-by-step code generation framework named API2Dep.<n>First, we introduce an enhanced Unified Modeling Language (UML) API diagram tailored for service-oriented architectures.<n>Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference task.
arXiv Detail & Related papers (2025-08-05T12:28:23Z) - WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding [58.1177179119881]
We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching.<n> wgrammar achieves up to 250x speedup over existing systems.
arXiv Detail & Related papers (2025-07-22T17:13:47Z) - Guided Tensor Lifting [54.10411390218929]
Domain-specific languages (s) for machine learning are revolutionizing the speed and efficiency of machine learning workloads.<n>To take advantage of these capabilities, a user must first translate their legacy code from the language it is currently written in, into the new DSL.<n>Process of automatically lifting code into these DSLs has been identified by several recent works, which propose program synthesis as a solution.
arXiv Detail & Related papers (2025-04-28T12:00:10Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - ProgSG: Cross-Modality Representation Learning for Programs in
Electronic Design Automation [38.023395256208055]
High-level synthesis (HLS) allows a developer to compile a high-level description in the form of software code in C and C++.
HLS tools still require microarchitecture decisions, expressed in terms of pragmas.
We propose ProgSG allowing the source code sequence modality and the graph modalities to interact with each other in a deep and fine-grained way.
arXiv Detail & Related papers (2023-05-18T09:44:18Z) - Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis.
When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.