Related papers: Integrating APK Image and Text Data for Enhanced Threat Detection: A Multimodal Deep Learning Approach to Android Malware

Integrating APK Image and Text Data for Enhanced Threat Detection: A Multimodal Deep Learning Approach to Android Malware

URL: http://arxiv.org/abs/2601.08959v1
Date: Tue, 13 Jan 2026 19:59:18 GMT
Title: Integrating APK Image and Text Data for Enhanced Threat Detection: A Multimodal Deep Learning Approach to Android Malware
Authors: Md Mashrur Arifin, Maqsudur Rahman, Nasir U. Eisty,
Abstract summary: This paper proposes a multimodal deep learning framework integrating APK images and textual features to enhance Android malware detection.<n>We evaluate various image types and resolutions across different Convolutional Neural Networks (CNN) architectures, including VGG, ResNet-152, MobileNet, DenseNet, EfficientNet-B4.<n>The findings demonstrate that RGB images at higher resolutions (e.g., 256x256, 512x512) achieve superior classification performance, while the multimodal integration of image and text using the CLIP model reveals limited potential.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As zero-day Android malware attacks grow more sophisticated, recent research highlights the effectiveness of using image-based representations of malware bytecode to detect previously unseen threats. However, existing studies often overlook how image type and resolution affect detection and ignore valuable textual data in Android Application Packages (APKs), such as permissions and metadata, limiting their ability to fully capture malicious behavior. The integration of multimodality, which combines image and text data, has gained momentum as a promising approach to address these limitations. This paper proposes a multimodal deep learning framework integrating APK images and textual features to enhance Android malware detection. We systematically evaluate various image types and resolutions across different Convolutional Neural Networks (CNN) architectures, including VGG, ResNet-152, MobileNet, DenseNet, EfficientNet-B4, and use LLaMA-2, a large language model, to extract and annotate textual features for improved analysis. The findings demonstrate that RGB images at higher resolutions (e.g., 256x256, 512x512) achieve superior classification performance, while the multimodal integration of image and text using the CLIP model reveals limited potential. Overall, this research highlights the importance of systematically evaluating image attributes and integrating multimodal data to develop effective malware detection for Android systems.

Related papers

More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
DMLDroid: Deep Multimodal Fusion Framework for Android Malware Detection with Resilience to Code Obfuscation and Adversarial Perturbations [1.3792857294744785]
We propose DMLDroid, an Android malware detection based on multimodal fusion.<n>We conduct exhaustive experiments independently on each feature, as well as in combination, using different fusion strategies.<n>Our findings highlight the benefits of multimodal fusion in improving both detection accuracy and robustness against evolving Android malware threats.
arXiv Detail & Related papers (2025-09-14T09:32:27Z)
Generating Multimodal Images with GAN: Integrating Text, Image, and Style [7.481665175881685]
We propose a multimodal image generation method based on Generative Adversarial Networks (GAN)<n>This method involves the design of a text encoder, an image feature extractor, and a style integration module.<n> Experimental results show that our method produces images with high clarity and consistency across multiple public datasets.
arXiv Detail & Related papers (2025-01-04T02:51:28Z)
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model [48.547599530927926]
Synthetic images, when shared on social media, can mislead extensive audiences and erode trust in digital content.<n>We introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages.<n>We propose a new image deepfake detection, localization, and explanation framework, named SIDA.
arXiv Detail & Related papers (2024-12-05T16:12:25Z)
Android Malware Detection Based on RGB Images and Multi-feature Fusion [3.1244204900991623]
This paper proposes an end-to-end Android malware detection technique based on RGB images and multi-feature fusion. Experiments demonstrate that the proposed method effectively captures Android malware characteristics, achieving an accuracy of up to 97.25%.
arXiv Detail & Related papers (2024-08-29T14:18:54Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts. Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z)
Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network. We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z)
DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode [7.556301451015502]
We develop and assess a baseline pipeline for image-based malware detection with straightforward steps. We propose DexRay, which converts the bytecode of the app DEX files into grey-scale "vector" images and feeds them to a 1-dimensional Convolutional Neural Network model. The performance of DexRay evaluated on over 158k apps demonstrates that, while simple, our approach is effective with a high detection rate.
arXiv Detail & Related papers (2021-09-05T16:33:05Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.