Content-Based Textual File Type Detection at Scale
- URL: http://arxiv.org/abs/2101.08508v1
- Date: Thu, 21 Jan 2021 09:08:42 GMT
- Title: Content-Based Textual File Type Detection at Scale
- Authors: Francesca Del Bonifro, Maurizio Gabbrielli, Stefano Zacchiroli
- Abstract summary: Programming language detection is a common need in the analysis of large source code bases.
We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Programming language detection is a common need in the analysis of large
source code bases. It is supported by a number of existing tools that rely on
several features, and most notably file extensions, to determine file types. We
consider the problem of accurately detecting the type of files commonly found
in software code bases, based solely on textual file content. Doing so is
helpful to classify source code that lack file extensions (e.g., code snippets
posted on the Web or executable scripts), to avoid misclassifying source code
that has been recorded with wrong or uncommon file extensions, and also shed
some light on the intrinsic recognizability of source code files. We propose a
simple model that (a) use a language-agnostic word tokenizer for textual files,
(b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram
frequencies, and (d) use a simple fully connected neural network as classifier.
As training set we use textual files extracted from GitHub repositories with at
least 1000 stars, using existing file extensions as ground truth. Despite its
simplicity the proposed model reaches 85% in our experiments for a relatively
high number of recognized classes (more than 130 file types).
Related papers
- FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Revisiting File Context for Source Code Summarization [2.85386288555414]
A typical use case is generating short summaries of subroutines for use in API documentation.
The heart of almost all current research into code summarization is the encoder-decoder neural architecture.
In this paper, we revisit the idea of file context'' for code summarization.
arXiv Detail & Related papers (2023-09-05T15:44:46Z) - Adversarial Networks and Machine Learning for File Classification [0.0]
Correctly identifying the type of file under examination is a critical part of a forensic investigation.
We propose using an adversarially-trained machine learning neural network to determine a file's true type.
Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types.
arXiv Detail & Related papers (2023-01-27T19:40:03Z) - CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file
Context [82.88371379927112]
We propose a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs.
CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and a 28.69% relative increase in identifier matching for code completion when the cross-file context is provided.
arXiv Detail & Related papers (2022-12-20T05:48:09Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Toward the Detection of Polyglot Files [2.7402733069180996]
It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats.
The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.
This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
arXiv Detail & Related papers (2022-03-14T23:48:22Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes [55.32790803903619]
We propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes.
Our method does not require any side-information, like annotated attributes or label meta-data.
We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes.
arXiv Detail & Related papers (2021-06-02T21:57:52Z) - Short Text Classification Approach to Identify Child Sexual Exploitation
Material [4.415977307120616]
This paper presents two approaches based on short text classification to identify Child Sexual Exploitation Material (CSEM) files.
The presented solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content.
arXiv Detail & Related papers (2020-10-29T09:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.