Trends in current malware datasets


Dataset Composition


In this set of experiments, we used our encyclopedia to retrieve different information about the malicious apps in the Malgenome, Piggybacking, and AMD datasets. We used such information to infer the following insights about those datasets.

Malgenome

Malgenome was gathered by Zhou et al. between 2010 and 2012. For years it has been considered the de facto Android malware dataset and has, consequently, been used by different researchers to evaluate their analysis/detection approaches and even included in newer datasets such as Drebin and AMD. More details about the gathering process can be found in the Zhou+2012 paper. Unfortunately, Malgenome is discontinued. However, it is included in the Drebin dataset, which is still accessible here.

Total apps found1234 (out of 1260)
Detected by X VirusTotal scanners (Average)31.43
Source Marketplace(s)Various (Zhou+2012)
Top 10 Families (as per VirusTotal)Droidkungfu (38%), BaseBridge (~25%), Geinimi (~5%), Kmin (~4%), Ddlight (3.8%), Golddream (3.7%), PJapps (3.6%), Lotoor (1.8%), yzhc (1.7%), adrd (1.7%)
Top 10 Types (as per VirusTotal)Trojan (94.6%), Exploit (3.4%), Spyware (1.7%), Syp++Trojan (1 app), FakeEnflict++Trojan (1 app)
% of scanner consensus on Family nameNo (100%): Not a single app had consensus on family name
% of scanner consensus on TypeNo (100%): Not a single app had consesnsus on type

Piggybacking

The Piggybacking dataset was gathered by Li et al. roughly between 2014 and 2017. The dataset is part of the AndroZoo project, and focuses on repackaged/piggybacked malware. The dataset comprises a list of original apps along with their piggybacked versions. We managed to download 1355 original apps and 1399 of their repackaged versions (N.B. one benign app can have multiple fake versions). Li et al. reported the trends exhibited by the piggybacked apps in their paper Li+2017. The following figures are gathered from statically analyzing the piggybacked (malicious) apps only.

Total apps found1136 (out of 1399)
Detected by X VirusTotal scanners (Average)9
Source Marketplace(s)Anzhi (~64%), AppChina (~12%), AnGeeks (5.1%), 1Mobile (5%), Google Play (4.4%), Malgenome (3.1%), Google Play + AppChina (8 apps), Anzhi + AppChina (5 apps), Slideme (4 apps), Mi.com + Anzhi (4 apps)
Top 10 Families (as per VirusTotal)Dowgin (24.6%), Kuguo (~22%), Gingermaster (~6%), Adwhirlads (5.6%) Ginermaster (5.5%), Admobads (~4%), Adwo (~3%), Youmi (2.2%), Droidkungfu (2.1%), Geinimi (1.7%)
Top 10 Types (as per VirusTotal)Adware (64%), Trojan (25%), Spyware (~2.3%), Riskware (2.2%), Adsware (2.2%), Troj (1.9%), Unclassified (6 apps), Adware++Adware (4 apps), TrojanSMS (2 apps), Spr (2 apps)
% of scanner consensus on Family name35% of apps had concensus on family name
% of scanner consensus on Type45% of apps had consensus on type

AMD

The most recent dataset we could find was gathered by Wei et al. in Wei+2017. The dataset comprises 24,000 malicious apps gathered from a multitude of marketplaces and older datasets. We randomly sampled the dataset to wind up with 1,250 malicious apps that honor the distribution of malware families within the original dataset as reported by the authors on their website. Unfortunately, we could gather information about a mere 204 apps from our Euphony-based encyclopedia.

Total apps found204 (out of 1250)
Detected by X VirusTotal scanners (Average)24.6
Source Marketplace(s)Malgenome (29%), Google Play (27.4%), AppChina (23%), Anzhi (10.7%), AnGeeks (~5%), AppChina + Google Play (4 apps), FreewareLovers (1 app), AppChina + Malgenome (1 app)
Top 10 Families (as per VirusTotal)Droidkungfu (20.5%), Airpush (8.8%), Ginmaster (8.3%), Kyview (6.8%), Dowgin (6.3%), GoldDream (5.8%), Nandrobox (5.4%), Lotoor (5.4%), Youmi (8apps), Kuguo (8 apps)
Top 10 Types (as per VirusTotal)Trojan (42.6%), Adware (34.4%), Exploit (11.2%), Riskware (3.9%), Monitor (3.4%), Spyware (1.9%), Hacktool (2 apps), Fakeupdates++Trojan (1 app), Addisplay (1 app)
% of scanner consensus on Family nameOnly 6% of apps had consensus on family name
% of scanner consensus on TypeOnly 7% of apps had consensus on type

A summary of the previously-reported results:

Dataset Total Apps Average Detection Rate Source (s) Top Families Top Types
Malgenome 1234 31.4 Various Droidkungfu (38%), Basebridge (25%), Geinimi (5%) Trojan (94%), Exploit (3%), Spyware (1%)
Piggybacking (Malicious Apps) 1136 9 Anzhi (64%), AppChina (12%), AnGeeks (5%) Dowgin (24%), Kuguo (22%), Gingermaster (6%) Adware (64%), Trojan (25%), Spyware (2%)
AMD 204 (out of 1250) 24.6 Malgenome (29%), Google Play (27%), AppChina (23%) Droidkungfu (20%), Airpush (8%), Ginmaster (8%) Trojan (42%), Adware (34%), Exploit (11%)

Repackaging Trends


Technically, it is quite straightforward to download an app, disassemble/decompile it, add some (malicious) code to its original code, re-assemble/compile it, sign it, and upload it to a (different) Android app marketplace. This techniques is usually referred to as repackaging/piggybacking. It is considered as an effective distribution method that trick users into voluntarily infecting their devices by downloading an installing what appears to be legitimate apps (e.g., a new version of Angry Birds). In this experiment, we wanted to measure the percentage of apps in each dataset that adopt such technique. To do so, we relied on a compiler fingerprinting tool, APKiD, to retrieve the compiler used to compile an app. If the compiler is not dex or dexmerge, which are used by the Android SDK, we assume that the "developer" did not have access to the app's souce code and had to rely on tools such as Apktool (and the compilers it utilizes such as dexlib)to compile the app (probably during repackaging).

Dataset dx dexmerge Not Repackaged (dx + dexmerge) Repackaged (dexlib 1.X + 2.X)
Malgenome 52% -- 52% 48%
Play Store (benign) 61% 34% 95% 5%
Piggybacking (malicious) 22% 6% 28% 72%
Piggybacking (benign) 61% 22% 83% 17%
AMD 38% 35% 73% ~27%


Detection Experiments

In this set of experiments, we wish to gain insights about the difficulty to detect malicious apps in the three datasets we are focusing on. Furthermore, we wish to investigate the cross-dataset detection capability. In other words, we wish to study whether a detection method built on top of a dataset can recognize malicious instances from another, possibly newer, dataset. This gives us a rough estimation about the lifespan of a dataset, within which it can be used to train effective classifiers.

Our detection experiments are based on machine learning classifiers trained using the static and dynamic features listed in the information page. To have a comprehensive view of the performance of different classifiers, we utilize a voting classifier trained using K-Nearest Neighbors with different values of K={10, 25, 50, 100, 250, 500}, random forests with different number of trees E={10, 25, 50, 75, 100}, and a Support Vector Machine with the linear kernel.

We encourage visitors of this website to reproduce our results performed using the feature vectors and Python scripts available under our downloads page.


Benchmark Classification


Prior to conducting the main experiments in this set, we trained and validated the voting classifier using 10-Fold cross-validation on each dataset individually. The table below summarizes the performance of our ensemble classifier on different datasets using static and dynamic features.

Dataset Accuracy Recall Precision Specificity F1 Score
Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic
Malgenome+GPlay 0.98 0.94 0.97 0.94 0.99 0.74 0.99 0.94 0.98 0.83
Piggybacking 0.67 0.67 0.70 0.70 0.63 0.76 0.65 0.61 0.63 0.73
AMD+GPlay 0.94 0.87 0.92 0.87 0.96 0.86 0.96 0.87 0.94 0.86

Visualization


In this set of experiments, we attempt to visually complement the results tabulated above by visualizing the relationship between malicious and benign apps in the previous experiment in 2- and 3-dimensional scatter plots. One can notice that there is a relationship between the performance of the ensemble classifier on a certain dataset and the distance between the apps within the same dataset. In other words, we argue that the t-SNE representation of apps in a lower dimensionality might imply the ease with which the ensemble classifier can separate such apps in a higher dimensionality.

The following table contains links to HTML 2- and 3-dimensional visualizations of malicious (in red) and benign (in blue) apps in the three datasets we experimented on earlier. The visualizations are scatter plots generated using the Python bindings of the Plotly graphic library. You can interact with the figures by zooming in/out and rotating the camera (only for 3-d figures). Please feel free to download the figures.


Dataset Static Features Dynamic Features
2-d 3-d 2-d 3-d
Malgenome+GPlay view view view view
Piggybacking view view view view
AMD+GPlay view view view view

Most difficult to detect


Given the poor performance achieved by the voting classifier on the Piggybacking dataset, we wanted to gain more insights about the malware families and types that proved to be difficult to detect by such classifier. The following figures depict the average percentages (across 10-Folds) of malware families and types in the Piggybacking dataset that were misclassified by the aforementioned classifier:

Using static features

Using dynamic features

Compatibility Experiments


In this set of experiments, we trained the voting classifier using one dataset and tested it using another dataset. Due to the differences in the release dates of such datasets, each experiment can give us indications about the forward, lateral, or backward applicability of each dataset. For example, in the first row, the classifier is trained using malicious apps from the Malgenome dataset (released in 2012) and benign apps we gathered in 2017 from the Google Play marketplace. The classifier is then tested using malicious and benign apps from the Piggybacking dataset (released around 2017). Thus, this experiment tests the forward applicability of the Malgenome dataset. In other words, such experiments enable us to answer questions such as how much the malicious apps have changed over the years, how comprehensive the training dataset is, and how long it takes for a dataset to be obsolete (i.e., unable to help classifiers recognize novel malware instances).

Training Dataset Test Dataset Accuracy Recall Precision Specificity F1 Score
Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic
Malgenome+GPlay Piggybacking 0.49 0.52 0.49 0.65 0.54 0.42 0.47 0.44 0.51 0.51
Malgenome+GPlay AMD+GPlay 0.90 0.79 0.96 0.93 0.83 0.60 0.86 0.73 0.90 0.73
AMD+GPlay Piggybacking 0.50 0.59 0.50 0.63 0.75 0.73 0.48 0.78 0.60 0.69
Piggybacking AMD+GPlay 0.47 0.63 0.47 0.57 0.48 0.86 0.48 0.78 0.47 0.69
AMD+GPlay Malgenome+GPlay 0.97 0.93 0.95 0.92 0.99 0.93 0.99 0.94 0.97 0.92
Piggybacking Malgenome+GPlay 0.51 0.63 0.51 0.55 0.34 0.94 0.51 0.89 0.40 0.70

Adversarial Experiments


This set of experiments builds on the observations and conclusions drawn from the experiments conducted in previous sections. Despite the simplicity of the static and dynamic features we utilize in this paper, we were able to use them to train voting classifiers that perform well on the Malgenome and AMD datasets. Nonetheless, our classifiers severely underperformed on the Piggybacking dataset. In this context, we conducted two sets of experiments that attempt to answer two questions. Firstly, is repackaging the reason behind the difficulty to detect apps in the Piggybacking dataset? Secondly, if so, how can malware authors leverage this fact to write evasive malware?

To answer these questions, we separated the malicious and benign subsets of the Piggybacking datasets and included them in the training and test datasets in a manner similar to that of the compatibility experiments. We refer to the malicious and benign subsets of the Piggybacking dataset as Piggybacked and Original, respectively.

# Training Dataset Test Dataset Accuracy
Static Dynamic
1 AMD+GPlay Piggybacked 0.81 0.72
2 AMD+GPlay Original 0.20 0.38
3 AMD+Original Piggybacked 0.17 0.50
4 AMD+Original Original 0.98 0.94
5 AMD+Malgenome+GPlay Piggybacked 0.81 0.79
6 AMD+Malgenome+GPlay Original 0.20 0.30
7 AMD+Original+GPlay Piggybacked 0.19 0.34
8 AMD+Original+GPlay Original 0.98 0.98
9 AMD+Malgenome+Original+GPlay Piggybacked 0.30 0.43
10 AMD+Malgenome+Original+GPlay Original 0.91 0.92