Comparison of Classification Algorithms for Detection of Phishing Websites

Vaitkevicius, Paulius; Marcinkevicius, Virginijus

doi:10.15388/20-INFOR404

Informatica

Comparison of Classification Algorithms for Detection of Phishing Websites

Volume 31, Issue 1 (2020), pp. 143–160

Paulius Vaitkevicius Virginijus Marcinkevicius

https://doi.org/10.15388/20-INFOR404

Pub. online: 23 March 2020 Type: Research Article

Open Access

Received
1 September 2019

Accepted
1 January 2020

Published
23 March 2020

Abstract

Phishing activities remain a persistent security threat, with global losses exceeding 2.7 billion USD in 2018, according to the FBI’s Internet Crime Complaint Center. In literature, different generations of phishing websites detection methods have been observed. The oldest methods include manual blacklisting of known phishing websites’ URLs in the centralized database, but they have not been able to detect newly launched phishing websites. More recent studies have attempted to solve phishing websites detection as a supervised machine learning problem on phishing datasets, designed on features extracted from phishing websites’ URLs. These studies have shown some classification algorithms performing better than others on differently designed datasets but have not distinguished the best classification algorithm for the phishing websites detection problem in general. The purpose of this research is to compare classic supervised machine learning algorithms on all publicly available phishing datasets with predefined features and to distinguish the best performing algorithm for solving the problem of phishing websites detection, regardless of a specific dataset design. Eight widely used classification algorithms were configured in Python using the Scikit Learn library and tested for classification accuracy on all publicly available phishing datasets. Later, classification algorithms were ranked by accuracy on different datasets using three different ranking techniques while testing the results for a statistically significant difference using Welch’s T-Test. The comparison results are presented in this paper, showing ensembles and neural networks outperforming other classical algorithms.

References

Adebowale, M., Lwin, K., Sánchez, E., Hossain, M. (2019). Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text. Expert Systems with Applications, 115, 300–313.

Anti-Phishing Working Group, I. (2018). Phishing Activity Trends Reports.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and regression trees. Wadsworth International Group, Belmont, CA, p. 432.

Chen, T.C., Stepan, T., Dick, S., Miller, J. (2014). An anti-phishing system employing diffused information. ACM Transactions on Information and System Security, 16(4), 1–31.

Chiew, K.L., Tan, C.L., Wong, K., Yong, K.S., Tiong, W.K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences, 484, 153–166.

Cui, B., He, S., Yao, X., Shi, P., Yao, X., He, S., Cui, B. (2018). Malicious URL detection with feature extraction based on machine learning. International Journal of High Performance Computing and Networking, 12(2), 166.

Dudani, S.A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(4), 325–327.

Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4), 367–378.

Internet Crime Complaint Center (2019). 2018 Internet Crime Report. Tech. Rep., Internet Crime Complaint Center at the Federal Bureau of Investigation of United States of America.

Jain, A.K., Gupta, B.B. (2018a). A machine learning based approach for phishing detection using hyperlinks information. Journal of Ambient Intelligence and Humanized Computing, 1–14.

Jain, A.K., Gupta, B.B. (2018b). Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems, 68(4), 687–700.

Karabatak, M., Mustafa, T. (2018). Performance comparison of classifiers on reduced phishing website dataset. In: 2018 6th International Symposium on Digital Forensic and Security (ISDFS). IEEE, pp. 1–5.

Lewis, D.D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In: ECML 1998: Machine Learning: ECML-98. Springer, Berlin, Heidelberg, pp. 4–15.

Lin Tan, C., Leng Chiew, K., Wong, K.S., Nah Sze, S., Tan, C.L., Chiew, K.L., Wong, K.S., Sze, S.N. (2016). PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Decision Support Systems, 88, 18–27.

Ma, J., Saul, L.K., Savage, S., Voelker, G.M. (2009). Beyond blacklists. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’09. ACM Press, New York, USA, p. 1245.

Marchal, S., Armano, G., Grondahl, T., Saari, K., Singh, N., Asokan, N. (2017). Off-the-hook: an efficient and usable client-side phishing prevention application. IEEE Transactions on Computers, 66(10), 1717–1733.

Marchal, S., Francois, J., State, R., Engel, T. (2014). Phish storm: detecting phishing with streaming analytics. IEEE Transactions on Network and Service Management, 11(4), 458–471.

Marchal, S., Saari, K., Singh, N., Asokan, N. (2016). Know your phish: novel techniques for detecting phishing sites and their targets. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp. 323–333.

Patil, D.R., Patil, J.B. (2018). Malicious URLs detection using decision tree classifiers and majority voting technique. Cybernetics and Information Technologies, 18(1), 11–29.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. (2011). Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Sahoo, D., Liu, C., Hoi, S.C.H. (2017). Malicious URL Detection using Machine Learning: A Survey.

Saxe, J., Berlin, K. (2017). eXpose: a character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702.08568.

Scholkopf, B., Smola, A.J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Seifert, C., Welch, I., Komisarczuk, P. (2008). Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference. IEEE, pp. 91–96.

Selvaganapathy, S., Nivaashini, M., Natarajan, H. (2018). Deep belief network based detection and categorization of malicious URLs. Information Security Journal: A Global Perspective, 27(3), 145–161.

Shapiro, S.S., Wilk, M.B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3–4), 591–611.

Shirazi, H., Bezawada, B., Ray, I. (2018). “Kn0w Thy Doma1n Name”. In: Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies – SACMAT ’18 Vol. 18. ACM Press, New York, USA, pp. 69–75.

Snedecor, G.W., Cochran, W.G. (1989). Statistical Methods, eight ed. Iowa State University Press, Ames, Iowa.

Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D. (2011). Design and evaluation of a real-time URL spam filtering service. In: 2011 IEEE Symposium on Security and Privacy. IEEE, pp. 447–462.

Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K., Koppen, M. (2016). Detecting malicious URLs using machine learning techniques. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, pp. 1–8.

Vazhayil, A., Vinayakumar, R., Soman, K. (2018). Comparative study of the detection of malicious URLs using shallow and deep networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, pp. 1–6.

Verma, R., Das, A. (2017). What’s in a URL. In: Proceedings of the 3rd ACM on International Workshop on Security And PrivacyAnalytics – IWSPA ’17. ACM Press, New York, USA, pp. 55–63.

Verma, R., Dyer, K. (2015). On the character of phishing URLs. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy – CODASPY ’15. ACM Press, New York, USA, pp. 111–122.

Wang, R. (2012). AdaBoost for feature selection, classification and its relation with SVM, a review. Physics Procedia, 25, 800–807.

Whittaker, C., Ryner, B., Nazif, M. (2010). Large-scale automatic classification of phishing pages. In: The 17th Annual Network and Distributed System Security Symposium (NDSS ’10).

Widrow, B., Lehr, M.A. (1990). 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442.

Xiang, G., Hong, J., Rose, C.P., Cranor, L. (2011). CANTINA+: a feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security, 14(2), 1–28.

Zhang, W., Jiang, Q., Chen, L., Li, C. (2017). Two-stage ELM for phishing Web pages detection using hybrid features. World Wide Web, 20(4), 797–813.

Zhao, J., Wang, N., Ma, Q., Cheng, Z. (2019). Classifying malicious urls using gated recurrent neural networks. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing. Springer, pp. 385–394.

Zhao, P., Hoi, S.C. (2013). Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’13. ACM Press, New York, USA, p. 919.

Biographies

Vaitkevicius Paulius

paulius.vaitkevicius@mif.vu.lt

P. Vaitkevicius is a doctoral student at Vilnius University, Institute of Data Science and Digital Technologies. His research interests include machine learning, artificial intelligence, cybersecurity, and natural language processing.

Marcinkevicius Virginijus

V. Marcinkevicius in 2010 received a doctoral degree in computer science (PhD) from Vytautas Magnus University. Since 2001 he is an employee of Vilnius University, Institute of Data Science and Digital Technologies. His present employment is senior researcher and the head or intelligent technologies research group of the Vilnius University, Institute of Data Science and Digital Technologies. His research interests include machine learning, artificial intelligence, cybersecurity, and natural language processing. He is the author of more than 70 scientific publications. He is a member of the Lithuanian Computer Society and Lithuanian Mathematical Society.

Full article Cited by

Open access article under the CC BY license.

Keywords

phishing detection classification algorithms phishing datasets

Metrics

since January 2020

2822

Article info
views

1123

Full article
views

1408

PDF
downloads

279

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file