Encyclopedia Malicia


During our experiments, we augmented some of the information provided by Euphony with information we gathered from malicious apps belonging to Malgenome, Piggybacking, and AMD (i.e., a total of more than 36,000 apps), to create an "encyclopedia" of malicious apps. We represent the gathered information as a CSV file, in which each app has the following fields:

sha256 The SHA256 hash of the APK
sha1The SHA1 hash of the APK
md5The MD5 hash of the APK
dex_dateThe date/time on which the app was allegedly compiled
apk_sizeThe size (in KB) of the app's APK archive
pkg_nameThe app package name (e.g., com.my.app)
vercodeThe app's version code
vt_detectionThe number of VirusTotal antiviral software that can detect the app
vt_scan_dateThe last date/time on which the app was scanned on VirusTotal
dex_sizeThe size (in KB) of the app's classes.dex file
marketsThe marketplace on which the app was found
nameThe app's malware family name (e.g., Gingermaster)
typesThe app's malware type (e.g., Trojan)
multiple_namesWhether the app is given multiple family names by VirusTotal scanners
multiple_typesWhether the app is given multiple types by VirusTotal scanners

Feature vectors


We extracted two types of features from the APK's in the previously mentioned datasets viz., static and dynamic. The static features, as their name suggests, were extracted from the APK without running them using the help of Androguard. The dynamic features were extracted from API call traces depicting the apps' runtime behavior. Apps were deployed on virtual devices, interacted with using a homemade tool, Droidutan, and monitored using droidmon. You can find a breakdown of the features under our information page.

The downloadable zip archive is divided into five directories, each containing two directories (i.e., static and dynamic). The five directories contain feature vectors (organized as Python lists) of the following datasets:

amdFeatures extracted from a sample (1,250 apps) of the AMD dataset
gplay16Features extracted a total of 1,1882 benign apps we downloaded from Google Play
malgenomeFeatures extracted from the apps belonging to the Malgenome dataset
originalFeatures extracted from the benign apps in the Piggybacking datasets
piggybackedFeatures extracted from the repackaged versions of the original apps

Scripts


Lastly, you can download a couple of Python scripts we used to run our classification experiments. Those scripts make use of some libraries included in Aion, a framework we are currently developing to analyze and detect Android malware; feel free to clone it and play with it.

compatibility.pyUsed to conduct the forward-backward compatibility experiments
experiment.pyUsed to conduct K-fold cross-validated classification on apps in a dataset and print statistics about the name/type of misclassified apps