During our experiments, we augmented some of the information provided by Euphony with information we gathered from malicious apps belonging to Malgenome, Piggybacking, and AMD (i.e., a total of more than 36,000 apps), to create an "encyclopedia" of malicious apps. We represent the gathered information as a CSV file, in which each app has the following fields:
sha256 | The SHA256 hash of the APK |
sha1 | The SHA1 hash of the APK |
md5 | The MD5 hash of the APK |
dex_date | The date/time on which the app was allegedly compiled |
apk_size | The size (in KB) of the app's APK archive |
pkg_name | The app package name (e.g., com.my.app) |
vercode | The app's version code |
vt_detection | The number of VirusTotal antiviral software that can detect the app |
vt_scan_date | The last date/time on which the app was scanned on VirusTotal |
dex_size | The size (in KB) of the app's classes.dex file |
markets | The marketplace on which the app was found |
name | The app's malware family name (e.g., Gingermaster) |
types | The app's malware type (e.g., Trojan) |
multiple_names | Whether the app is given multiple family names by VirusTotal scanners |
multiple_types | Whether the app is given multiple types by VirusTotal scanners |
We extracted two types of features from the APK's in the previously mentioned datasets viz., static and dynamic. The static features, as their name suggests, were extracted from the APK without running them using the help of Androguard. The dynamic features were extracted from API call traces depicting the apps' runtime behavior. Apps were deployed on virtual devices, interacted with using a homemade tool, Droidutan, and monitored using droidmon. You can find a breakdown of the features under our information page.
The downloadable zip archive is divided into five directories, each containing two directories (i.e., static and dynamic). The five directories contain feature vectors (organized as Python lists) of the following datasets:
amd | Features extracted from a sample (1,250 apps) of the AMD dataset |
gplay16 | Features extracted a total of 1,1882 benign apps we downloaded from Google Play |
malgenome | Features extracted from the apps belonging to the Malgenome dataset |
original | Features extracted from the benign apps in the Piggybacking datasets |
piggybacked | Features extracted from the repackaged versions of the original apps |
Lastly, you can download a couple of Python scripts we used to run our classification experiments. Those scripts make use of some libraries included in Aion, a framework we are currently developing to analyze and detect Android malware; feel free to clone it and play with it.
compatibility.py | Used to conduct the forward-backward compatibility experiments |
experiment.py | Used to conduct K-fold cross-validated classification on apps in a dataset and print statistics about the name/type of misclassified apps |