"vcd2df" -- Leveraging Data Science Insights for Hardware Security Research
- URL: http://arxiv.org/abs/2505.06470v3
- Date: Tue, 10 Jun 2025 02:20:30 GMT
- Title: "vcd2df" -- Leveraging Data Science Insights for Hardware Security Research
- Authors: Calvin Deutschbein, Jimmy Ostler, Hriday Raj,
- Abstract summary: We create a bridge from hardware design languages (HDLs) to data science languages like Python and R.<n>We show how insights can be derived in high-level languages from register transfer level (RTL) trace data.
- Score: 0.6554326244334868
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we hope to expand the universe of security practitioners of open-source hardware by creating a bridge from hardware design languages (HDLs) to data science languages like Python and R through novel libraries that convert VCD (value change dump) files into data frames, the expected input type of the modern data science tools. We show how insights can be derived in high-level languages from register transfer level (RTL) trace data. Additionally, we show a promising future direction in hardware security research leveraging the parallelism of Spark to study transient execution CPU vulnerabilities, and provide reproducibility researchers via GitHub and Colab.
Related papers
- VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation [1.0798445660490976]
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation.<n>We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management.<n>The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for fine-tuning to our knowledge.
arXiv Detail & Related papers (2025-07-09T17:06:54Z) - BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z) - DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis [14.341633834445307]
DeepCircuitX is a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area ( PPA) analysis.<n>Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code.<n>DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels.
arXiv Detail & Related papers (2025-02-25T15:34:00Z) - Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis [14.458529723566379]
Large language models (LLMs) can be employed for programming languages such as Python and C++.<n>This paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design.
arXiv Detail & Related papers (2025-02-19T17:53:59Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report [3.4632900249241874]
This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source.
The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval.
The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors.
arXiv Detail & Related papers (2024-10-21T12:21:49Z) - Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - Verilog-to-PyG -- A Framework for Graph Learning and Augmentation on RTL
Designs [15.67829950106923]
We introduce an innovative open-source framework that translates RTL designs into graph representation foundations.
The Verilog-to-PyG (V2PYG) framework is compatible with the open-source Electronic Design Automation (EDA) toolchain OpenROAD.
We will present novel RTL data augmentation methods that enable functional equivalent design augmentation for the construction of an extensive graph-based RTL design database.
arXiv Detail & Related papers (2023-11-09T20:11:40Z) - Towards the Imagenets of ML4EDA [24.696892205786742]
We describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis.
The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks.
The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis.
arXiv Detail & Related papers (2023-10-16T16:35:03Z) - Linking the Dynamic PicoProbe Analytical Electron-Optical Beam Line /
Microscope to Supercomputers [39.52789559084336]
Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day.
While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams.
We address this problem by providing a software architecture capable of supporting neighboring large-scale data transfers to neighboring supercomputers at the Argonne Leadership Computing Facility.
This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.
arXiv Detail & Related papers (2023-08-25T23:07:58Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - PyRelationAL: a python library for active learning research and development [1.0061110876649197]
Active learning (AL) is a sub-field of ML focused on the development of methods to iteratively and economically acquire data.
Here, we introduce PyRelationAL, an open source library for AL research.
We describe a modular toolkit based around a two step design methodology for composing pool-based active learning strategies.
arXiv Detail & Related papers (2022-05-23T08:21:21Z) - Torchhd: An Open Source Python Library to Support Research on
Hyperdimensional Computing and Vector Symbolic Architectures [99.70485761868193]
We present Torchhd, a high-performance open source Python library for HD/VSA.
Torchhd seeks to make HD/VSA more accessible and serves as an efficient foundation for further research and application development.
arXiv Detail & Related papers (2022-05-18T20:34:25Z) - Data Engineering for HPC with Python [0.0]
Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements.
One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications.
We present a distributed Python API based on table abstraction for representing and processing data.
arXiv Detail & Related papers (2020-10-13T11:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.