The duplication issue within the Drebin dataset

Paul Irolla; Alexandre Dey

doi:10.1007/s11416-018-0316-z

The duplication issue within the Drebin dataset

Paul Irolla, Alexandre Dey

Źródło

Journal of Computer Virology and Hacking Techniques > 2018 > 14 > 3 > 245-249

Abstrakt

The Drebin dataset (in: NDSS, 2014) is the most supplied academic dataset of Android malware. Therefore it is the most used dataset in research papers on Android malware detection. The research community is using it for evaluation and comparison of their algorithms. We discovered that 49.35% of samples in this dataset has at least one other sample that is a repackaged version containing exactly the same sequence of opcode. The only differences between the original malware and the duplicated ones, in all cases, are the resources embedded and some strings in the code. For assessing the performance of malware detectors or classifiers, a part of the dataset is used for this purpose. So a major part of the testing set end up beeing the same samples that have been used in the training set. This situation can lead us, the research community, to overrate the performance of algorithms we are designing. In the worst case, it leads us to wrong conclusions and wrong directions for future research. Then we conduct an experiment where we test several classification algorithms on the Drebin dataset with and without the duplicates. Our results show that depending on the classifier the full dataset can lead from moderately (124%) to strongly (172%) underrated inaccuracy, and the order of performance of the algorithms is modified. Finally we provide the list of unique malware samples from the Drebin dataset, available on Github.

Identyfikatory

e-ISSN czasopisma :	2263-8733
DOI	10.1007/s11416-018-0316-z

Autorzy

Paul Irolla

École d’ingnieurs du monde numrique (ESIEA), Laboratoire de cryptologie et virologie oprationnelles (CVOLab), Laval, France

Alexandre Dey

École d’ingnieurs du monde numrique (ESIEA), Laval, France

Słowa kluczowe

Android Malware detection Machine learning Dataset

Informacje dodatkowe

Właściciel praw autorskich:Springer-Verlag France SAS, part of Springer Nature, 2018

Języki publikacji: angielski

Zbiór danych: Springer

Wydawca

Springer Paris

Obszary wiedzy

Nie zaproponowano jeszcze żadnych obszarów wiedzy

artykuł

Czytaj online
Pobierz
Dodaj do przeczytania
Dodaj do kolekcji
Dodaj do obserwowanych
Podziel się

Eksport do bibliografii


Przypisz innemu użytkownikowi
	×
Niepoprawny email

INFONA - portal komunikacji naukowej

The duplication issue within the Drebin dataset $("#expandableTitles").expandable();

Źródło

Abstrakt

Identyfikatory

Autorzy

Przypisywanie użytkownika

Potwierdzenie anulowania przypisania

Czy jesteś pewien, że chcesz anulować to przypisanie?

Paul Irolla

Alexandre Dey

Słowa kluczowe

Informacje dodatkowe

Wydawca

Obszary wiedzy

Proponowanie obszarów wiedzy

Podziel się

Eksport do bibliografii

Zgłaszanie błędu / nadużycia

Nieudane wysłanie zgłoszenia

Ułatwienia dostępu

The duplication issue within the Drebin dataset