DexBERT: Effective, Task-Agnostic and Fine-grained Representation
Learning of Android Bytecode
- URL: http://arxiv.org/abs/2212.05976v2
- Date: Thu, 24 Aug 2023 09:00:09 GMT
- Title: DexBERT: Effective, Task-Agnostic and Fine-grained Representation
Learning of Android Bytecode
- Authors: Tiezhu Sun (1), Kevin Allix (1), Kisub Kim (2), Xin Zhou (2), Dongsun
Kim (3), David Lo (2), Tegawend\'e F. Bissyand\'e (1) and Jacques Klein (1)
((1) University of Luxembourg, (2) Singapore Management University, (3)
Kyungpook National University)
- Abstract summary: We propose a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications.
We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks.
- Score: 0.40571357119162643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automation of a large number of software engineering tasks is becoming
possible thanks to Machine Learning (ML). Central to applying ML to software
artifacts (like source or executable code) is converting them into forms
suitable for learning. Traditionally, researchers have relied on manually
selected features, based on expert knowledge which is sometimes imprecise and
generally incomplete. Representation learning has allowed ML to automatically
choose suitable representations and relevant features. Yet, for Android-related
tasks, existing models like apk2vec focus on whole-app levels, or target
specific tasks like smali2vec, which limits their applicability. Our work is
part of a new line of research that investigates effective, task-agnostic, and
fine-grained universal representations of bytecode to mitigate both of these
two limitations. Such representations aim to capture information relevant to
various low-level downstream tasks (e.g., at the class-level). We are inspired
by the field of Natural Language Processing, where the problem of universal
representation was addressed by building Universal Language Models, such as
BERT, whose goal is to capture abstract semantic information about sentences,
in a way that is reusable for a variety of tasks. We propose DexBERT, a
BERT-like Language Model dedicated to representing chunks of DEX bytecode, the
main binary format used in Android applications. We empirically assess whether
DexBERT is able to model the DEX language and evaluate the suitability of our
model in three distinct class-level software engineering tasks: Malicious Code
Localization, Defect Prediction, and Component Type Classification. We also
experiment with strategies to deal with the problem of catering to apps having
vastly different sizes, and we demonstrate one example of using our technique
to investigate what information is relevant to a given task.
Related papers
- Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication.
We look at the application of building CS corpora.
We make the task more realistic by scaling it to more languages.
We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Toolformer: Language Models Can Teach Themselves to Use Tools [62.04867424598204]
Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale.
We show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds.
We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
arXiv Detail & Related papers (2023-02-09T16:49:57Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation [80.18830380517753]
We develop a new task-agnostic distillation framework XtremeDistilTransformers.
We study the transferability of several source tasks, augmentation resources and model architecture for distillation.
arXiv Detail & Related papers (2021-06-08T17:49:33Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.