Comparison of Classification Algorithms for Detection of Phishing Websites

Phishing activities remain a persistent security threat, with global losses exceeding 2.7 billion USD in 2018, according to the FBI’s Internet Crime Complaint Center. In literature, different generations of phishing websites detection methods have been observed. The oldest methods include manual blacklisting of known phishing websites’ URLs in the centralized database, but they have not been able to detect newly launched phishing websites. More recent studies have attempted to solve phishing websites detection as a supervised machine learning problem on phishing datasets, designed on features extracted from phishing websites’ URLs. These studies have shown some classification algorithms performing better than others on differently designed datasets but have not distinguished the best classification algorithm for the phishing websites detection problem in general. The purpose of this research is to compare classic supervised machine learning algorithms on all publicly available phishing datasets with predefined features and to distinguish the best performing algorithm for solving the problem of phishing websites detection, regardless of a specific dataset design. Eight widely used classification algorithms were configured in Python using the Scikit Learn library and tested for classification accuracy on all publicly available phishing datasets. Later, classification algorithms were ranked by accuracy on different datasets using three different ranking techniques while testing the results for a statistically significant difference using Welch’s T-Test. The comparison results are presented in this paper, showing ensembles and neural networks outperforming other classical algorithms.


Introduction
Phishing is a form of cybercrime employing both social engineering and technical trickery to steal sensitive information, such as digital identity data, credit card data, login credentials, and other personal data, etc. from unsuspecting users by masking as a trustworthy entity. For example, the victim receives an e-mail from an adversary with a threatening message such as a possible bank or social media account termination or fake alert on illegal transaction (Lin Tan et al., 2016), directing him to a fraudulent website that mimics a legitimate one. The adversary can use any information that the victim enters in the phishing website to steal identity or money (Whittaker et al., 2010).
Although there are many existing anti-phishing solutions, phishers continue to lure more and more victims. In 2018, the Anti-Phishing Working Group (APWG) reported as many as 785,920 unique phishing websites detected, with a 69.5% increase during the last five years of monitoring, from 463,750 unique phishing websites detected in 2014 (Anti-Phishing Working Group, 2018). Global losses from phishing activities exceeded 2.7 billion USD in 2018, according to the FBI's Internet Crime Complaint Center (Internet Crime Complaint Center, 2019).
Deceptive phishing attacks are still so successful nowadays because, in essence, they are "human-to-human" assaults performed by professional adversaries who (i) have financial motivation for their actions, (ii) exploit lack of awareness and computer illiteracy of ordinary Internet users (Adebowale et al., 2019), and (iii) manage to learn from their previous experience and improve their future attacks to lure new victims more successfully. For this reason, ordinary Internet users cannot keep up with new trends of phishing attacks and learn to differentiate a legitimate website's URL from a malicious one, relying solely on their efforts.
In order to protect Internet users from criminal assaults, automated detection techniques for phishing websites recognition were started to develop. The oldest approach included manual blacklisting of known phishing websites' URLs in centralized databases, later used by Internet browsers to alert users about possible threats. The negative aspect of the blacklisting method is that these databases do not include newly launched phishing websites and therefore do not protect from "the zero hour" attacks, as most of the phishing URLs are inserted in centralized databases only 12 hours after the first phishing attack (Jain and Gupta, 2018a). More recent studies have attempted to solve phishing websites detection as a supervised machine learning problem. Many authors have conducted experiments using various classification methods and different phishing datasets with predefined features (Chiew et al., 2019;Marchal et al., 2016;Sahoo et al., 2017).
The following open questions motivate our research: 1. State-of-the-art methods of phishing website detection report classification accuracy (the classification accuracy measure is described in Section 3.4.1) well above 99.50% and use different classification algorithms: ensembles (Gradient Boosting) (Marchal et al., 2017), statistical models (Logistic Regression) (Whittaker et al., 2010), probabilistic algorithms (Bayesian Network) (Xiang et al., 2011), classification trees (C4.5) (Cui et al., 2018). There is no common agreement about what classification algorithm is the most accurate in phishing website prediction on datasets with predefined features (Chiew et al., 2019). 2. State-of-the-art methods demonstrate such high classification accuracies on highly unbalanced datasets with minority and majority classes. Classification accuracy measure has low construct validity on datasets where class balance is not proportional and show better results for the preferred class. Doubts remain, whether these results were demonstrated due to dataset dependent method design (Chiew et al., 2019) or algorithms used in state-of-the-art research are preeminent compared to others. 3. To the best of our knowledge, no studies comparing classic classification algorithms' performance on all publicly available phishing datasets with predefined features were conducted to answer the questions mentioned above.
Therefore, the objective of this experimental research is to answer the research question: which classical classification algorithm is best for solving the phishing websites detection problem, on all publicly available datasets with predefined features?
In this paper we compare eight classic supervised machine learning algorithms of different types (for more details see Section 3.2) on three publicly available phishing datasets with predefined features being used by the scientific community in experiments with classification algorithms (for more details on datasets see Section 3.3).
We trained and tested all these algorithms upon all three datasets. Later we ranked these algorithms by their classification accuracy measure on different datasets using three different ranking techniques while testing the results for a statistically significant difference using Welch's T-Test.
The rest of the paper is organized as follows: In Section 2 we give a review of related work. In Section 3 we describe our research methodology. In Section 4 we report our experiment results. We conclude the paper in Section 5.

Related Works
The scientific community has spent a lot of effort to tackle the problem of phishing websites detection. In general, approaches to solving this problem can be grouped into three different categories: (i) blacklisting and heuristic-based approaches (more in Section 2.1), (ii) supervised machine learning approaches (more in Section 2.2), and deep learning approaches (more in Section 2.3) (Sahoo et al., 2017).

Review of Blacklisting and Heuristics-Based Research
Although there are initiatives to use centralized phishing websites' URLs blacklisting solutions (e.g., PhishTank, 1 Google Safe Browsing API 2 ), this method was proven unsuccessful as it takes time to detect and report a malicious URL, because phishing websites have a very short lifespan (from a few hours to a few days) (Verma and Das, 2017). Therefore, new phishing websites' URL detection methods were started to be implemented by the scientific community.
Heuristic approaches are an improvement on blacklisting techniques where the signatures of common attacks are identified and blacklisted for the future use of Intrusion Detection Systems (Seifert et al., 2008). Heuristic methods supersede conventional blacklisting methods as they have better generalization capabilities and can detect threats in new URLs, but they cannot generalize to all types of new threats (Verma and Das, 2017).

Review of Supervised Machine Learning Based Research
During the last decade, most of the machine learning approaches to solve phishing websites detection problem were based on the supervised machine learning methods on phishing datasets with predefined features. In Table 1, we present a detailed summary of other authors' results of this problem solving during the last ten years of study. Our review consists of the publication year, authors, used classifier, dataset composition (numbers of phishing and legitimate websites), and achieved classification accuracy. Results are sorted by accuracy from highest to lowest.
From this review, we can make the following observations: • Two best approaches scored as high as 99.9% by accuracy.
• The most popular algorithms among researchers are Random Forest (8 papers), Naïve-Bayes (7 papers), SVM (7 papers), C4.5 (7 papers 3 ), Logistic Regression (6 papers). • Best 5 approaches scored above 99.49% and were implemented using different types of classifiers: neural networks, regression, decision trees, ensembles, and Bayesian. We see no prevailing classification method or type of method among top results. • Best 5 approaches use highly unbalanced datasets, therefore, evaluating classifier performance by accuracy is inadequate and does not tell how this classifier would perform on more balanced datasets.

Review of Deep Learning Based Research
During the past few years, novel approaches to solve phishing websites detection problem using deep learning techniques were introduced by the scientific community. Zhao et al. have demonstrated that Gated Recurrent Neural Network (GRU) without the need for manual feature creation is capable of classifying malicious URLs with 98.5% accuracy on 240,000 phishing and 150,000 legitimate websites URL samples (Zhao et al., 2019). Saxe and Berlin have performed an experiment with Convolutional Neural Network (CNN), automating the process of feature design and extraction from generic raw character strings (malicious URLs, file paths, etc.) and gaining 99.30% accuracy on 19,067,879 randomly Table 1 Classification approaches to the solution of the phishing websites detection problem.

Research Methodology
In this section, we describe our research methodology by defining: • experimental design for our research (Section 3.1), 4 C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. 5 Classification And Regression Tree. 6 WEKA's class for generating a pruned or unpruned C4.5 decision tree. 7 Rule based learner which combines C4.5 trees and RIPPER learning. 8 C5.0 is an algorithm used to generate a decision tree developed by Ross Quinlan. 9 ID3 (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree developed by Ross Quinlan. 10 A propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), which was proposed by William W. Cohen.
We discuss the validity of our results in Section 3.5.

Experimental Design
In this subsection, we present our experimental design, employed to perform the experiment, and answer the research question. The experiment was divided into three parts: (i) training the classifiers for each dataset, (ii) ranking the classifiers, and (iii) creating unified classifier ranking.

Part I: Training the Classifiers
The objective of this part is to train all the classifiers from Section 3.2 on all the datasets from Section 3.3 for their best possible classification accuracy described in Section 3.4.1, formula (2). For every dataset, for every classifier, we take the following steps: 1. Set up the classifier for a specific dataset in Python's environment using the Scikit Learn library (Pedregosa et al., 2011). 2. Manually select a set of hyper-parameters, referring to Scikit Learn's user guide. 3. Train and test the classifier using Scikit Learn's cross-validation (CV) function, with 30 stratified folds. 4. Plot learning curves. 5. Analyse learning curves and make a decision on tuning the hyper-parameters by answering the following questions: • Is the algorithm learning on training data or memorizing it? If the training curve is flat at 100%, then the algorithm is not learning but memorizing the data. To solve this issue, we take actions, e.g. reduce the number of weak learners in an ensemble, reduce the depth of the tree, increase the regularization parameter, etc. • Is the algorithm prone to overfitting (low bias, high variance) or underfitting (high bias, low variance), or learns "just right"? If the gap between training and CV curves is small, the algorithm is underfitting; if the gap is big, it is overfitting. To solve this issue we take actions to reduce high bias or high variance, e.g. (i) add more training examples, use a smaller set of features, increase the regularization parameter, etc. to reduce high variance, and (ii) use bigger set of features, add polynomial features, increase the number of layers in the neural network, reduce the regularization parameter, etc. to reduce high bias.
If the decision is made to tune the hyper-parameters to avoid high bias or high variance, then we start over from Step 2; if not, we go to Step 6. 6. Perform a Wilk-Shapiro test, as described in Section 3.4, formula (4) to check if the accuracies of classifier's classification from 30-fold CV testing are normally distributed.
If not, take action to normalize the values. 7. Save the results for further actions.
We finish this part when all the classifiers are trained on all the datasets, and we have normally distributed sets of classification accuracies for each classifier on each dataset.

Part II: Ranking the Classifiers
The objective of this part is to rank all the classifiers by their classification results within each individual dataset.
For every dataset, we take the following steps: 1. Using Welch's T-test, described in Section 3.4.2, formula (3), check every possible pair of classifiers if their classification results produced in Part I have statistically significant differences. The classification results are distributed by normal distribution. 2. Arrange all classifiers by their mean classification accuracy in descending order. 3. Assign each classifier three ranks using ranking techniques described in Section 3.4.4. Important notice: classifiers whose results have no statistically significant differences receive the same rank. 4. For each ranking technique, distribute points from the highest 10 to the lowest 1 for each classifier, depending on the received rank. Points are calculated using formula (1).
where • N methods -number of algorithms participating in ranking, • Rank i -rank of i th algorithm.

Save the results for further actions.
We finish this part when all the classifiers receive ranks and ranking points by all ranking techniques on all datasets.

Part III: Creating the Unified Classifier Ranking
The objective of this part is to summarize the performance of selected classifiers on all datasets by creating a unified ranking. To do this, we combine rankings for each classifier by adding all the points received upon all datasets. Our experiment is complete after finishing this part.

Algorithms
In the review of supervised machine learning approaches in Section 2.2, we showed that five best implementations employ different classifiers from separate types of supervised machine learning algorithms: neural networks, decision trees, ensembles, regression, and Bayesian. We also disclosed that the top 3 classifiers by popularity are Random Forest (8 papers), Naïve-Bayes (7 papers), SVM (7 papers).
For our research, we built the set of algorithms consisting of: • three most popular algorithms from the review of related works (Section 2.2), • five more algorithms from the Scikit Learn library, belonging to the best performing types of classifiers in the review of related works (Section 2.2).
All possible classical classification algorithms were not used due to the limitation of resources available for this research. Therefore, in our experiment, we chose to use classic supervised machine learning algorithms such as AdaBoost, Classification and Regression Tree, Gradient Tree Boosting, k-Nearest Neighbours, Multilayer Perceptron with backpropagation, Naïve-Bayes, Random Forest, and Support-Vector Machine.

Datasets
In our experiment, we used three publicly available phishing websites datasets with predefined features. To our knowledge, these are the only phishing datasets with predefined features made publicly available by other researchers.

Classification Accuracy
Classification accuracy in our experiment is the rate of phishing and legitimate websites which are identified correctly with respect to all the websites, defined as follows: where • TP -number of websites, correctly detected as phishing (True Positive), • TN -number of websites, correctly detected as benign (True Negative), • FP -number of legitimate websites, incorrectly detected as phishing (False Positive), • FN -number of phishing websites, incorrectly detected as legitimate (False Negative).
We chose classification accuracy as our classification quality quantification metric because: (i) most other researchers use classification accuracy to define results of their experiments (see Section 2), therefore the comparability of research results is homogeneous throughout our work; (ii) in our experiment we used datasets with equal or close to equal class distributions (there is no significant disparity between the number of positive and negative labels), therefore we do not have the majority and minority classes; (iii) we used cross-validation function with stratification option which generates test sets such that all contain the same distribution of classes, or as close as possible; (iv) we do not directly compare classification results of different datasets by accuracy and do not draw any conclusions from this information; to distinguish top classifiers we employ ranking techniques (see Section 3.4.4). In these circumstances, classification accuracy is a useful non-bias measure.

Welch's T-Test
Welch's T-test in our experiment is used to determine whether the means of classification accuracy results produced by any two classifiers within the same dataset have a statistically significant difference. The two-sample T-test for unpaired data is defined as follows (Snedecor and Cochran, 1989).
Let X 1 , . . . , X n and Y 1 , . . . , Y m be two independent samples from normal distributions, and μ x , μ y be the means of these distributions. Then, the hypothesis to be tested is defined as The test statistic for testing the hypothesis is calculated as follows: where •X andȲ are the sample X 1 , . . . , X n and Y 1 , . . . , Y m means, • n and m are the sample X 1 , . . . , X n and Y 1 , . . . , Y m sizes, • S 2 x and S 2 y are the sample X 1 , . . . , X n and Y 1 , . . . , Y m variances. We reject the null hypothesis H 0 that the two means are equal if |T | > t 1−α/2,v , where t 1−α/2,v is the critical value of the t distribution with v degrees of freedom with our chosen α = 0.05. Welch's T-test can only be performed on samples from normal distributions. We used scipy.stats package for Python to perform a T-test.

Shapiro-Wilk Test
Shapiro-Wilk test is used to check whether samples came from a normally distributed population (Shapiro and Wilk, 1965). This test is defined as follows: where • x (i) are the ordered sample values, x (1) being the smallest, • a i are constants generated from the means, variances and covariances of the order statistics of a sample of size n from normal distributions, •x is the sample mean, • n is the sample size.
We reject the null hypothesis H 0 that the sample belongs to normal distribution if W < W α , where W α is the critical threshold. We used scipy.stats package for Python to perform a Shapiro-Wilk test.

Ranking Techniques
Ranking techniques used in our research are: 1. Standard Competition Ranking (SCR), where equal items get the same ranking number, and then a gap is left in the ranking numbers, i.e. "1224" ranking. 2. Dense Ranking (DR), where equal get the same ranking number, and the next item gets the immediately following ranking number, i.e. "1223" ranking. 3. Fractional Ranking (FR), where equal items get the same ranking number, which is the mean of what they would have under ordinal rankings, i.e. "1 2.5 2.5 4" ranking.

Validity
In our experiment we used classification accuracy measure described in Section 3.4.1, formula (2) and balanced datasets (see Section 3.3). Classification accuracy has a high construct validity on balanced datasets. We used the cross-validation procedure with 30 stratified folds to evaluate classification accuracy, which provides an objective measure of how well the model fits and how well it will generalize to new data.
Welch's T-test was used to measure if the means of classification accuracy results produced by any two classifiers within the same dataset have a statistically significant difference. This test eliminated the possibility to miss-rank the classifiers, whose results had no statistically significant differences.
Three different ranking techniques were introduced to overcome ranking bias, where distinct ranking techniques give different outcomes.
We provide the source code of our experiment to other researchers at https://github.com/ PauliusVaitkevicius/Exp001.

Results
In this section we present our experiment results based on the research methodology described in Section 3.
First, we configured selected classification algorithms (described in Section 3.2) for each dataset (described in Section 3.3). We used implementations for all selected algorithms from the Scikit Learn library (version 0.20.1) in Python (version 3.7.1), which provides open-source tools for data mining and data analysis (Pedregosa et al., 2011). Later, we chose the best fitting hyper-parameters for each algorithm on each dataset with 30-fold cross validation, following our experimental design, described in Section 3.1. Selected best hyper-parameters for each classifier are described in Table 2. Selected best hyper-parameters differ in algorithm configurations for different datasets due to applying the hyper-parameter selection technique, described in Section 3.1 Part I, to datasets with different designs and data quantities.
Subsequently, we trained and tested all the classifiers chosen for this experiment on all the datasets. We measured classification performance by accuracy: the ratio of phishing and legitimate URLs, which are classified correctly with respect to all the URLs in the dataset as described in Section 3.4.1, formula (2). Classification results are given in Table 3. Initial results showed that the Gradient Tree Boosting algorithm performed best on MDP-2018 and UCI-2016 datasets, and Multilayer Perceptron with backpropagation performed best on the UCI-2015 dataset.
Later, we evaluated all classification results against each other within an individual dataset using Welch's T-test, as described in Section 3.4.2, formula (3), to check if they have statistically significant differences. Afterward, we ordered all the classifiers by their performance upon each dataset using three different ranking techniques: SCR, FR, and DR, as described in Section 3.1. Classifiers, whose results had no statistically significant differences, were given equal ranks. Next, points from the highest 10 to the lowest 1 were distributed to each classifier depending on the assigned rank.
Ranking results for the UCI-2015 dataset are presented in Table 4, with Multilayer Perceptron ranking in the first place for all ranking techniques.
Results for the UCI-2016 dataset are presented in Table 5, showing Multilayer Perceptron, Gradient Tree Boosting, CART, and Random Forest all scoring maximum points, as their classification accuracy had no statistically significant difference.
And last, ranking results for the MDP-2018 dataset are presented in Table 6, with Gradient Tree Boosting, AdaBoost, and Random Forest all ranking in the first place for all ranking techniques.
Finally, combined dataset rankings were calculated in Table 7, summing up all the points each classifier has scored for each dataset, showing various sets of algorithms ending up in the 1st place with different ranking techniques. If we rank results using the Standard Competition Ranking technique, we get Random Forest and Gradient Tree Boosting ranked at the top. If we rank results using the Fractional Ranking technique, we get Multilayer Perceptron ranked at the top. If we rank results using the Dense Ranking technique, we get Random Forest, Multilayer Perceptron, and Gradient Tree Boosting ranked at the top. There is no single algorithm ranked at the top using all three ranking techniques.  4.5 2 k-Nearest Neighbours 9 6.5 9 2 4.5 2 SVM with 2nd deg. pol. kernel 9 6.5 9 2 4.5 2 Gradient Tree Boosting 9 6.5 9 2 4 . 5 2 C A R T 9 6 . 5 9 2 4.5 2 AdaBoost 9 6.5 9 8 8.   9 5 N a ï v e -B a y e s 2 2 6 10 10 6 k-Nearest Neighbours 1 1 5

Conclusions
In this paper, we provide an answer to our research question: which classical classification algorithm is best for solving the phishing websites detection problem, on all publicly available datasets with predefined features? From our research, we make the following conclusions: 1. Neural Networks, in our case Multilayer Perceptron and ensemble type algorithms (Random Forest, Gradient Tree Boosting, and AdaBoost) perform best for solving the phishing websites detection problem, on datasets used in the experiment. 2. Instance similarity-based and Bayesian classifiers, i.e. SVM, k-Nearest Neighbours, and Naïve-Bayes performance is the poorest for solving the phishing websites detection problem, regardless of a specific dataset design. 3. Results discussed in conclusions #1 and #2 coincide with general trends in related works review (Section 2.2): best classification results are achieved with neural networks, decision trees, and ensemble types of classification algorithms. 4. Classifiers showing above a 99.0% classification accuracy on highly unbalanced datasets in related works review (Section 2.2), i.e. Random Forest, SVM, Perceptron, and CART did not score such high accuracy on any balanced dataset in our experiment.
In future work, hyper-parameter tuning can be automated using the Grid Search algorithm instead of manual expert hyper-parameter evaluation. P. Vaitkevicius is a doctoral student at Vilnius University, Institute of Data Science and Digital Technologies. His research interests include machine learning, artificial intelligence, cybersecurity, and natural language processing. V. Marcinkevicius in 2010 received a doctoral degree in computer science (PhD) from Vytautas Magnus University. Since 2001 he is an employee of Vilnius University, Institute of Data Science and Digital Technologies. His present employment is senior researcher and the head or intelligent technologies research group of the Vilnius University, Institute of Data Science and Digital Technologies. His research interests include machine learning, artificial intelligence, cybersecurity, and natural language processing. He is the author of more than 70 scientific publications. He is a member of the Lithuanian Computer Society and Lithuanian Mathematical Society.