<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR457</article-id>
<article-id pub-id-type="doi">10.15388/21-INFOR457</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Study of Multi-Class Classification Algorithms’ Performance on Highly Imbalanced Network Intrusion Datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-8331-4352</contrib-id>
<name><surname>Bulavas</surname><given-names>Viktoras</given-names></name><email xlink:href="viktoras.bulavas@itpc.vu.lt">viktoras.bulavas@itpc.vu.lt</email><xref ref-type="aff" rid="j_infor457_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>V. Bulavas</bold> is a data privacy and information security officer at Vilnius University. His research interests include machine learning, information security and privacy. His academic background includes MSc in physics from Vilnius University and a MSc in public management from the Norwegian School of Management. He is certified with CISA, CGEIT, CRISC and CSM.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-2281-4035</contrib-id>
<name><surname>Marcinkevičius</surname><given-names>Virginijus</given-names></name><email xlink:href="virginijus.marcinkevicius@mif.vu.lt">virginijus.marcinkevicius@mif.vu.lt</email><xref ref-type="aff" rid="j_infor457_aff_001">1</xref><bio>
<p><bold>V. Marcinkevičius</bold> is a senior researcher, head of the Intelligent Technologies Research Group, and head of Artificial Intelligence Laboratory at Vilnius University, Institute of Data Science and Digital Technologies. His research interests include machine learning, information security and natural language processing. His academic background includes MSc in mathematics from Vilnius Educational University an PhD in informatics from Vytautas Magnus University.</p></bio>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-2266-0088</contrib-id>
<name><surname>Rumiński</surname><given-names>Jacek</given-names></name><email xlink:href="jacek.ruminski@pg.edu.pl">jacek.ruminski@pg.edu.pl</email><xref ref-type="aff" rid="j_infor457_aff_002">2</xref><bio>
<p><bold>J. Rumiński</bold> is a professor at Gdańsk University of Technology and a head of the Department of Biomedical Engineering, also a head of Gdańsk AI Bay club. His research interests include biomedical engineering and information security. His academic background includes MSc in medical devices, PhD in healthcare informatics and habilitation in biocybernetics and biomedical engineering from Gdańsk University of Technology.</p></bio>
</contrib>
<aff id="j_infor457_aff_001"><label>1</label>Institute of Data Science and Digital Technologies, <institution>Vilnius University</institution>, Akademijos str. 4, LT-08663 Vilnius, <country>Lithuania</country></aff>
<aff id="j_infor457_aff_002"><label>2</label>Faculty of Electronics, Telecommunications and Informatics, <institution>Gdańsk University of Technology</institution>, 11/12 Gabriela Narutowicza, 80-233 Gdańsk, <country>Poland</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2021</year></pub-date><pub-date pub-type="epub"><day>7</day><month>9</month><year>2021</year></pub-date><volume>32</volume><issue>3</issue><fpage>441</fpage><lpage>475</lpage><history><date date-type="received"><month>3</month><year>2021</year></date><date date-type="accepted"><month>7</month><year>2021</year></date></history>
<permissions><copyright-statement>© 2021 Vilnius University</copyright-statement><copyright-year>2021</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>This paper is devoted to the problem of class imbalance in machine learning, focusing on the intrusion detection of rare classes in computer networks. The problem of class imbalance occurs when one class heavily outnumbers examples from the other classes. In this paper, we are particularly interested in classifiers, as pattern recognition and anomaly detection could be solved as a classification problem. As still a major part of data network traffic of any organization network is benign, and malignant traffic is rare, researchers therefore have to deal with a class imbalance problem. Substantial research has been undertaken in order to identify these methods or data features that allow to accurately identify these attacks. But the usual tactic to deal with the imbalance class problem is to label all malignant traffic as one class and then solve the binary classification problem. In this paper, however, we choose not to group or to drop rare classes but instead investigate what could be done in order to achieve good multi-class classification efficiency. Rare class records were up-sampled using SMOTE method (Chawla <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_007">2002</xref>) to a preset ratio targets. Experiments with the 3 network traffic datasets, namely CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) and LITNET-2020 (Damasevicius <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>) were performed aiming to achieve reliable recognition of rare malignant classes available in these datasets.</p>
<p>Popular machine learning algorithms were chosen for comparison of their readiness to support rare class detection. Related algorithm hyper parameters were tuned within a wide range of values, different data feature selection methods were used and tests were executed with and without over-sampling to test the multiple class problem classification performance of rare classes.</p>
<p>Machine learning algorithms ranking based on <italic>Precision</italic>, <italic>Balanced Accuracy Score</italic>, <inline-formula id="j_infor457_ineq_001"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, and prediction error <italic>Bias and Variance decomposition</italic>, show that decision tree ensembles (<italic>Adaboost, Random Forest Trees and Gradient Boosting Classifier</italic>) performed best on the network intrusion datasets used in this research.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>network intrusion detection</kwd>
<kwd>multi-class classification</kwd>
<kwd>imbalanced learning</kwd>
<kwd>bias and variance decomposition</kwd>
<kwd>SMOTE</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_infor457_s_001">
<label>1</label>
<title>Introduction</title>
<p>Detection of intrusions into networks, information systems or workstations, as well as detection of malware and unauthorized activities of individuals, have emerged into a global challenge. A part of cybernetic defence challenges is addressed by optimizing the intrusion detection systems (IDS). There are three methods of intrusion detection (Koch, <xref ref-type="bibr" rid="j_infor457_ref_032">2011</xref>): known pattern recognition (signature-based), anomaly based detection, and a hybrid of the previous two. Anomaly based detection is currently mainly implemented as a support for zero-day network perimeter defence of big infrastructures and network operators, while signature based intrusion prevention remains the main mode of defence for most businesses and households. Pattern recognition or anomaly detection can be seen as classification problems. Classification problems refer to the problems in which the variable to be predicted is categorical. In network traffic the benign data is most often represented by a large number of examples, while malignant traffic appears extremely rarely or is an absolute rarity. This is known as the class imbalance problem and is a known obstacle to the induction of good classifiers by Machine Learning (ML) algorithms (Batista <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_002">2004</xref>).</p>
<p>He and Ma (<xref ref-type="bibr" rid="j_infor457_ref_025">2013</xref>) define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. He and Ma (<xref ref-type="bibr" rid="j_infor457_ref_025">2013</xref>) introduced informal conventions for imbalanced dataset classification. A dataset where the most common class is less than twice as common as the rarest class would be marginally unbalanced. A dataset with the imbalance ratio of about 10 : 1 would be modestly imbalanced, and a dataset with imbalance ratios above 1000 : 1 would be extremely unbalanced. This sort of imbalance is found in medical record databases regarding rare diseases, or production of electronic equipment, where non-faulty examples heavily outnumber faulty examples. Cases when negative to positive ratios are close to or higher than 1 000 000 : 1 are called absolute rarity imbalance. This sort of imbalance is found in cyber security, where all but a few network traffic flows are benign. However, standard ML algorithms are still capable of inducing good classifiers for extremely imbalanced training sets. This shows that class imbalance is not the only problem responsible for the decrease in performance of learning algorithms. Batista <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_002">2004</xref>) have demonstrated that a part of the problem to have class separation is often an overlap of classes due to a lack of feature separation. Another reason could be a lack of attributes, specific to a certain decision boundary. It is known that in cases where negative class has an internal structure (multimodal class), an overlap between negative and positive classes can be observed on a few of the clusters within negative class.</p>
<p>This study reports results of the empirical research executed with selected supervised machine learning classification algorithms in an attempt to compare their efficiency for intrusion detection and get improved results compared to other published studies. The study consists of the following sections: Section <xref rid="j_infor457_s_003">2</xref>, introduction of the data sources, Section <xref rid="j_infor457_s_011">3</xref>, a review of machine learning methods and model benchmark metrics used in this study, Section <xref rid="j_infor457_s_030">4</xref>, an overview of the experiment and pre-processing steps, Section <xref rid="j_infor457_s_036">5</xref>, results and conclusions.</p>
<sec id="j_infor457_s_002">
<label>1.1</label>
<title>Contribution</title>
<p>The research question raised in this study is <italic>which supervised machine learning method consistently provides the best multi-class classification results with large and highly imbalanced network datasets</italic>. To answer this question we chose the CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) and LITNET-2020 (Damasevicius <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>) datasets as they are recent realistic software-generated traffic network datasets and meet the required criteria (Gharib <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>) for a good network intrusion dataset. An answer to this question is that based on rankings of performance metrics and bias-variance decomposition the tree ensembles <italic>Adaboost, RandomForest Trees and Gradient Boosting Classifier</italic> performed best on the network intrusion datasets used in this research.</p>
<p>The novelty of this research is in a proposed methodology (see Section <xref rid="j_infor457_s_030">4</xref>) and application of it for the recent and not yet in depth studied dataset LITNET-2020. A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>) is first introduced in Section <xref rid="j_infor457_s_005">2.2</xref>. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section <xref rid="j_infor457_s_012">3.1</xref>) is used to reduce imbalance of classes in a nonlinear fashion. SMOTE up-sampling for numeric data and SMOTE-NC for categorical data (see Section <xref rid="j_infor457_s_013">3.2</xref>) is executed to increase representation of rare classes. Further in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the LITNET-2020 dataset is discussed in Section <xref rid="j_infor457_s_036">5</xref>. Multi-class performance <italic>macro-averaged</italic> metrics are implemented in this research. Balanced accuracy (Formula (<xref rid="j_infor457_eq_002">2</xref>)) and geometric mean of recall (Formula (<xref rid="j_infor457_eq_004">4</xref>)) for the LITNET-2020 dataset are implemented for the first time (see results in Tables <xref rid="j_infor457_tab_016">16</xref> and <xref rid="j_infor457_tab_017">17</xref>). Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section <xref rid="j_infor457_s_030">4</xref>). For decision tree ensemble methods, instead of the weak <italic>CART</italic> base classifiers, parameters <italic>Tree depth</italic> and <italic>alpha</italic> were GirdSearched and validated using the method of maximum cost path analysis (Breiman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>), see Section <xref rid="j_infor457_s_027">3.8</xref>. Additional ML model, Gradient Boosting Classifier, utilizing ensemble of Classification and regression trees (CART), was introduced for benchmark in this research via the use of <italic>XGBoost</italic> library (Chen and Guestrin, <xref ref-type="bibr" rid="j_infor457_ref_008">2016</xref>) with GPU support (see Section <xref rid="j_infor457_s_022">3.5.6</xref>). In our methodology, due to the highly imbalanced nature of the used data, cost sensitive method implementations were chosen. These choices lead to better results (see Table <xref rid="j_infor457_tab_020">20</xref>) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research is achieved through decomposition of classification error into bias and variance (see results in Table <xref rid="j_infor457_tab_018">18</xref>).</p>
</sec>
</sec>
<sec id="j_infor457_s_003">
<label>2</label>
<title>Datasets Used</title>
<p>The following section presents a review of datasets considered for this research together with arguments for the choice made.</p>
<sec id="j_infor457_s_004">
<label>2.1</label>
<title>Datasets Considered for Analysis</title>
<p>There are many datasets that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Far from being complete, the list includes: DARPA 1998 (Lippmann <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_042">1999</xref>) and 1999 traces by Lincoln Laboratory, USA, KDD’99 (Hettich and Bay, <xref ref-type="bibr" rid="j_infor457_ref_026">1999</xref>), CAIDA (The Cooperative Association for Internet Data Analysis, <xref ref-type="bibr" rid="j_infor457_ref_065">2010</xref>) datasets by University of California, USA, the Internet Traffic Archive and LBNL traces by Lawrence Berkeley National Laboratory, USA (Lawrence Berkeley National Laboratory, <xref ref-type="bibr" rid="j_infor457_ref_038">2010</xref>), DEFCON by The Shmoo Group (<xref ref-type="bibr" rid="j_infor457_ref_066">2011</xref>), ISCX IDS 2012 (Shiravi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_060">2012</xref>), CIDDS-001 (Coburg Intrusion Detection Data Set) (Ring <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_051">2017</xref>) and others. However, it has been widely acknowledged that machine learning research in an intrusion detection area needs to include new attack types and therefore researchers should consider more recent data sources.</p>
<p>In this research, three recent network data sets, compliant to the criteria described further (see Section <xref rid="j_infor457_s_005">2.2</xref>) suggested by their authors for intrusion detection research, are explored. The datasets chosen are CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) by the University of Brunswick, Canada, and LITNET-2020 (Damasevicius <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>). These datasets are of significant volume, contain anonymized real academic network traffic and are suited for multiple purposes of machine learning. LITNET-2020 is a new dataset that is given particular attention in this research, with discussion of compliance to the dataset suitability as devised by Gharib <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>).</p>
</sec>
<sec id="j_infor457_s_005">
<label>2.2</label>
<title>Requirements for Cybersecurity Datasets</title>
<p>Criteria for building such datasets are discussed by Małowidzki <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_044">2015</xref>), Buczak and Guven (<xref ref-type="bibr" rid="j_infor457_ref_006">2016</xref>), Maciá-Fernández <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_043">2018</xref>), Ring <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_052">2019</xref>), Damasevicius <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>), and others.</p>
<p>Małowidzki <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_044">2015</xref>) define the following features of a good dataset: it must contain recent data, be realistic, contain all typical attacks met in the wild, be labelled, be correct regarding operating cycles in enterprises (working hours), should be flow-based. Ring <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_052">2019</xref>) contend that a good dataset should be comparable with real traffic and therefore have more normal than malicious traffic, since most of the traffic within a company is normal and only a small part is malicious. Detailed framework and analysis of criteria for such datasets is proposed by Canadian Institute for Cybersecurity (CIC) at the University of New Brunswik. Gharib <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>) have proposed the eleven dataset selection criteria. These criteria are presented in Table <xref rid="j_infor457_tab_001">1</xref>. Following this publication of the criteria, CIC created a list of new datasets,<xref ref-type="fn" rid="j_infor457_fn_001">1</xref><fn id="j_infor457_fn_001"><label><sup>1</sup></label>
<p>See <uri>https://www.unb.ca/cic/datasets/index.html</uri>.</p></fn> addressing issues of compliance to these criteria. Creation of the CSE-CIC-IDS2018 followed with improvements, such as decreasing number of duplicates and uncertainties. Thakkar and Lohiya (<xref ref-type="bibr" rid="j_infor457_ref_063">2020</xref>) in Sections 4.1 and 4.2, Tables 4 and 5, and Karatas <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>) in Sections III.C (CIC-IDS21017) and III. D (CSE-CIC-IDS2018) provide discussion and support to these claims.</p>
<table-wrap id="j_infor457_tab_001">
<label>Table 1</label>
<caption>
<p>Dataset compliance criteria by Gharib <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">No.</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Criteria</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: char">1.</td>
<td style="vertical-align: top; text-align: left">Complete network configuration</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">2.</td>
<td style="vertical-align: top; text-align: left">Complete traffic</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">3.</td>
<td style="vertical-align: top; text-align: left">Labelled dataset</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">4.</td>
<td style="vertical-align: top; text-align: left">Complete interaction</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">5.</td>
<td style="vertical-align: top; text-align: left">Complete record</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">6.</td>
<td style="vertical-align: top; text-align: left">Available protocols</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">7.</td>
<td style="vertical-align: top; text-align: left">Attack diversity</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">8.</td>
<td style="vertical-align: top; text-align: left">Anonymity</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">9.</td>
<td style="vertical-align: top; text-align: left">Heterogeneity</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char">10.</td>
<td style="vertical-align: top; text-align: left">Feature set</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: char; border-bottom: solid thin">11.</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Metadata</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor457_s_006">
<label>2.3</label>
<title>LITNET-2020 Compliance</title>
<p>The LITNET-2020 dataset was selected for the current study as complying to most of the above mentioned requirements with some reservations regarding interaction completeness, heterogeneity and feature set completeness criteria.</p>
<p>These eleven criteria as applied to LITNET-2020 are discussed below. 
<list>
<list-item id="j_infor457_li_001">
<label>1.</label>
<p>Complete network configuration: In order to investigate the real course of attacks, it is necessary to test the real network configuration. All of the network flows in this dataset are received or generated at the Network of Lithuanian academic institutions LITNET.</p>
</list-item>
<list-item id="j_infor457_li_002">
<label>2.</label>
<p>Complete traffic: The dataset accumulates full packet flows from the source to the destination, which can be a workstation computer, router or another specialized service device.</p>
</list-item>
<list-item id="j_infor457_li_003">
<label>3.</label>
<p>Labelled dataset: The dataset is labelled into a single benign and 12 malignant classes. The benign class is not separately labelled into sub-classes, however, it could be done because the number of benign records is exceeding 36 million records and is close to <inline-formula id="j_infor457_ineq_002"><alternatives><mml:math>
<mml:mn>92</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$92\% $]]></tex-math></alternatives></inline-formula> of the whole dataset.</p>
</list-item>
<list-item id="j_infor457_li_004">
<label>4.</label>
<p>Complete interaction: The correct interpretation of the data requires data from the entire network interoperability process. LITNET-2020 dataset, however, is a pure network traffic dataset with no correlated host memory or host log information.</p>
</list-item>
<list-item id="j_infor457_li_005">
<label>5.</label>
<p>Record completeness: The LITNET-2020 dataset is compliant with this requirement.</p>
</list-item>
<list-item id="j_infor457_li_006">
<label>6.</label>
<p>Various protocols: Records of 13 types of protocols for normal and 3 types of protocols for malignant traffic are available in the LITNET-2020 dataset.</p>
</list-item>
<list-item id="j_infor457_li_007">
<label>7.</label>
<p>Diversity and novelty of attacks: The dataset includes attack flows that were detected from 2019-03-06 first flow and 2020-01-31 last flow.</p>
</list-item>
<list-item id="j_infor457_li_008">
<label>8.</label>
<p>Anonymity: It is important that the simulated set contain data for which privacy is not important. The LITNET-2020 data set contains no personally identifiable data.</p>
</list-item>
<list-item id="j_infor457_li_009">
<label>9.</label>
<p>Heterogeneity: Data from different sources, such as network streams, operating system logs, or network equipment logs, memory images, must be available. LITNET-2020 is not compliant with this requirement.</p>
</list-item>
<list-item id="j_infor457_li_010">
<label>10.</label>
<p>Feature Set/Attribute Linkage: It is important for the research that data from different types of sources for the same event be linked, for example, device memory view, network traffic, and device logs. LITNET-2020 is not compliant with this requirement as it contains no linked host sources.</p>
</list-item>
<list-item id="j_infor457_li_011">
<label>11.</label>
<p>Metadata and documentation: Information about attributes, how the traffic was generated or collected, network configuration, attackers and victims, machine operating system versions and attack scenarios are required to do the research. LITNET-2020 is documented in Damasevicius <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>).</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_infor457_s_007">
<label>2.4</label>
<title>Cybersecurity Dataset Imbalance Problem</title>
<p>In datasets selected for the research, the benign class takes from <inline-formula id="j_infor457_ineq_003"><alternatives><mml:math>
<mml:mn>80</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$80\% $]]></tex-math></alternatives></inline-formula> up to <inline-formula id="j_infor457_ineq_004"><alternatives><mml:math>
<mml:mn>92</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$92\% $]]></tex-math></alternatives></inline-formula> of total records (see Table <xref rid="j_infor457_tab_002">2</xref>), and some small classes only have less than <inline-formula id="j_infor457_ineq_005"><alternatives><mml:math>
<mml:mn>0.001</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$0.001\% $]]></tex-math></alternatives></inline-formula> (see Table <xref rid="j_infor457_tab_004">4</xref>). The following Table <xref rid="j_infor457_tab_002">2</xref> is a summary of the data set imbalance of benign versus malignant records:</p>
<table-wrap id="j_infor457_tab_002">
<label>Table 2</label>
<caption>
<p>Dataset content split.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record Type</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">80.3%</td>
<td style="vertical-align: top; text-align: left">83.1%</td>
<td style="vertical-align: top; text-align: left">92.0%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Malignant</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">19.7%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">16.9%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8.0%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The following Table <xref rid="j_infor457_tab_003">3</xref> presents the split of malignant classes and is a summary of dataset imbalance shares in accordance with the taxonomy described by He and Ma (<xref ref-type="bibr" rid="j_infor457_ref_025">2013</xref>):</p>
<table-wrap id="j_infor457_tab_003">
<label>Table 3</label>
<caption>
<p>Dataset imbalance.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Imbalance category<sup>1</sup></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Modest &lt;(10 : 1)</td>
<td style="vertical-align: top; text-align: left">8.16%</td>
<td style="vertical-align: top; text-align: left">0.00%</td>
<td style="vertical-align: top; text-align: left">0.00%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">High &lt;(1000 : 1)</td>
<td style="vertical-align: top; text-align: left">11.39%</td>
<td style="vertical-align: top; text-align: left">16.85%</td>
<td style="vertical-align: top; text-align: left">7.83%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Extreme &gt;(1000 : 1)</td>
<td style="vertical-align: top; text-align: left">0.15%</td>
<td style="vertical-align: top; text-align: left">0.08%</td>
<td style="vertical-align: top; text-align: left">0.20%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total Malignant</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">19.7%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">16.9%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8.0%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Share of records in imbalance category.</p>
</table-wrap-foot>
</table-wrap>
<p>The following Table <xref rid="j_infor457_tab_004">4</xref> represents a summary of extremely imbalanced (&gt;1000 : 1) classes in the three selected datasets.</p>
<table-wrap id="j_infor457_tab_004">
<label>Table 4</label>
<caption>
<p>Extremely rare classes in the datassets.</p>
</caption>
<table>
<thead>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2018</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Share</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Share</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Share</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">0.0695%</td>
<td style="vertical-align: top; text-align: left">DoS-Slowloris</td>
<td style="vertical-align: top; text-align: left">0.0677%</td>
<td style="vertical-align: top; text-align: left">W32.Blaster</td>
<td style="vertical-align: top; text-align: left">0.0660%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-Web</td>
<td style="vertical-align: top; text-align: left">0.0532%</td>
<td style="vertical-align: top; text-align: left">LOIC-UDP<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">0.0107%</td>
<td style="vertical-align: top; text-align: left">ICMP Flood</td>
<td style="vertical-align: top; text-align: left">0.0638%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-XSS</td>
<td style="vertical-align: top; text-align: left">0.0230%</td>
<td style="vertical-align: top; text-align: left">Brute Force-Web</td>
<td style="vertical-align: top; text-align: left">0.0038%</td>
<td style="vertical-align: top; text-align: left">HTTP Flood</td>
<td style="vertical-align: top; text-align: left">0.0630%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infiltration</td>
<td style="vertical-align: top; text-align: left">0.0013%</td>
<td style="vertical-align: top; text-align: left">Brute Force-XSS</td>
<td style="vertical-align: top; text-align: left">0.0014%</td>
<td style="vertical-align: top; text-align: left">Scan</td>
<td style="vertical-align: top; text-align: left">0.0170%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SQL Injection</td>
<td style="vertical-align: top; text-align: left">0.0007%</td>
<td style="vertical-align: top; text-align: left">SQL Injection</td>
<td style="vertical-align: top; text-align: left">0.0005%</td>
<td style="vertical-align: top; text-align: left">Reaper Worm</td>
<td style="vertical-align: top; text-align: left">0.0032%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Heartbleed</td>
<td style="vertical-align: top; text-align: left">0.0004%</td>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">0.0021%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left">Fragmentation</td>
<td style="vertical-align: top; text-align: left">0.0013%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total Extreme &gt;(1 000 : 1)</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.15%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.08%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.20%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>DDOS attack.</p> 
</table-wrap-foot>
</table-wrap>
<p>Various imbalance measures are discussed by Ortigosa-Hernández <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_047">2017</xref>) in a study, dedicated to such measures. In Karatas <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>), section III.E, authors review most practical to use imbalance ratios of several IDS datasets, including the CIC-IDS2017 and CSE-CIC-IDS2018.</p>
<p>Referring to Ortigosa-Hernández <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_047">2017</xref>) and Karatas <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>), the following Formula (<xref rid="j_infor457_eq_001">1</xref>) can be used for the calculation of the imbalance ratio: 
<disp-formula id="j_infor457_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext>Imbalance Ratio</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">ρ</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo movablelimits="false">min</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \text{Imbalance Ratio}=\rho =\frac{\max \{{C_{i}}\}}{\min \{{C_{i}}\}},\]]]></tex-math></alternatives>
</disp-formula> 
where: <inline-formula id="j_infor457_ineq_006"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula> shows the data size in the class <italic>i</italic>.</p>
<p>For example, historical NSL-KDD has an imbalance ratio of 648, CIC-IDS2017 has an imbalance ratio of 112 287 and CSE-CIC-IDS2018 has a slightly better imbalance ratio of 53 887. LITNET-2020 has an imbalance ratio of 70 769.</p>
<p>While <italic>imbalance ratios</italic> are an important part of the discussion, the <italic>absolute rarity</italic> is another concept introduced by He and Ma (<xref ref-type="bibr" rid="j_infor457_ref_025">2013</xref>) for the case when there is not enough records to learn the class. If there is not enough information within the feature-scape, determination of decision boundary cannot be made. There are no such classes in the LITNET-2020 datasets, and the data was sufficient for learning to all the machine learning algorithms used in our experiment. However, <italic>Infiltration</italic>, <italic>Heartbleed</italic> and <italic>Web Attack-Aql Injection</italic> classes in the CIC-IDS2017 dataset exhibit behaviour of such an absolute rarity and learning the decision boundaries for these classes is complicated and unspecific. In CSE-CIC-IDS2018 dataset, even though <italic>Infiltration</italic> class records are abundant, high overlap with benign class is observed.</p>
</sec>
<sec id="j_infor457_s_008">
<label>2.5</label>
<title>CIC-IDS-2017</title>
<p>The CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick<xref ref-type="fn" rid="j_infor457_fn_002">2</xref><fn id="j_infor457_fn_002"><label><sup>2</sup></label>
<p>More information at <uri>https://www.unb.ca/cic/datasets/ids-2017.html</uri>.</p></fn> and introduces labelled data of 14 types of attacks including <italic>DDoS</italic>, <italic>Brute Force</italic>, <italic>XSS</italic>, <italic>SQL Injection</italic>, <italic>Infiltration</italic>, and <italic>Botnet</italic>. The traffic was emulated in a test environment during a period from July 3 to July 7, 2017. Network traffic features and related aggregates were extracted and generated using the CICFlowMeter tool and made available in a form of 8 CSV files. The CICFlowMeter is an open source tool<xref ref-type="fn" rid="j_infor457_fn_003">3</xref><fn id="j_infor457_fn_003"><label><sup>3</sup></label>
<p>More information at <uri>https://www.unb.ca/cic/research/applications.html</uri>.</p></fn> provided by CIC at UNB that generates bidirectional flows from pcap files, and extracts features from these flows, made available to the research community by Draper-Gil <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_013">2016</xref>) and further described by Lashkari <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_035">2017</xref>). The dataset contains a total of 2 830 743 records with flow data, synthetic features and is labelled.</p>
<p>The following Table <xref rid="j_infor457_tab_005">5</xref> is a summary of class representation of this dataset.</p>
<table-wrap id="j_infor457_tab_005">
<label>Table 5</label>
<caption>
<p>Class representation in CIC-IDS2017 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Traffic class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record count</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">BENIGN</td>
<td style="vertical-align: top; text-align: left">2 273 097</td>
<td style="vertical-align: top; text-align: left">80.3004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Hulk</td>
<td style="vertical-align: top; text-align: left">231 073</td>
<td style="vertical-align: top; text-align: left">8.1630%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PortScan</td>
<td style="vertical-align: top; text-align: left">158 930</td>
<td style="vertical-align: top; text-align: left">5.6144%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDoS</td>
<td style="vertical-align: top; text-align: left">128 027</td>
<td style="vertical-align: top; text-align: left">4.5227%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS GoldenEye</td>
<td style="vertical-align: top; text-align: left">10 293</td>
<td style="vertical-align: top; text-align: left">0.3636%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-Patator</td>
<td style="vertical-align: top; text-align: left">7 938</td>
<td style="vertical-align: top; text-align: left">0.2804%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SSH-Patator</td>
<td style="vertical-align: top; text-align: left">5 897</td>
<td style="vertical-align: top; text-align: left">0.2083%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS slowloris</td>
<td style="vertical-align: top; text-align: left">5 796</td>
<td style="vertical-align: top; text-align: left">0.2048%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Slowhttptest</td>
<td style="vertical-align: top; text-align: left">5 499</td>
<td style="vertical-align: top; text-align: left">0.1943%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">1 966</td>
<td style="vertical-align: top; text-align: left">0.0695%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-Brute Force</td>
<td style="vertical-align: top; text-align: left">1 507</td>
<td style="vertical-align: top; text-align: left">0.0532%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-XSS</td>
<td style="vertical-align: top; text-align: left">652</td>
<td style="vertical-align: top; text-align: left">0.0230%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infiltration</td>
<td style="vertical-align: top; text-align: left">36</td>
<td style="vertical-align: top; text-align: left">0.0013%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-SQL Injection</td>
<td style="vertical-align: top; text-align: left">21</td>
<td style="vertical-align: top; text-align: left">0.0007%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Heartbleed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">11</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0004%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Dataset features, all measures of duration or related aggregates, further used for this research belong to these categories:</p>
<list>
<list-item id="j_infor457_li_012">
<label>•</label>
<p>Fiat (Forward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent in forward direction;</p>
</list-item>
<list-item id="j_infor457_li_013">
<label>•</label>
<p>Biat (Backward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent backwards;</p>
</list-item>
<list-item id="j_infor457_li_014">
<label>•</label>
<p>Flowiat (Flow Inter Arrival Time, mean, min, max, std): aggregates on the time between two flows sent in either direction;</p>
</list-item>
<list-item id="j_infor457_li_015">
<label>•</label>
<p>Active (mean, min, max, std): aggregates on the amount of time a flow was active before going idle;</p>
</list-item>
<list-item id="j_infor457_li_016">
<label>•</label>
<p>Idle (mean, min, max, std): aggregates on the amount of time a flow was idle before becoming active;</p>
</list-item>
<list-item id="j_infor457_li_017">
<label>•</label>
<p>Flow Bytes/s: Flow bytes sent per second;</p>
</list-item>
<list-item id="j_infor457_li_018">
<label>•</label>
<p>Flow Packets/s: Flow packets sent per second;</p>
</list-item>
<list-item id="j_infor457_li_019">
<label>•</label>
<p>Duration: The duration of a flow.</p>
</list-item>
</list>
</sec>
<sec id="j_infor457_s_009">
<label>2.6</label>
<title>CSE-CIC-IDS2018</title>
<p>The CSE-CIC-IDS2018 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick.<xref ref-type="fn" rid="j_infor457_fn_004">4</xref><fn id="j_infor457_fn_004"><label><sup>4</sup></label>
<p>More information at <uri>https://www.unb.ca/cic/datasets/ids-2018.html</uri>.</p></fn> Data was emulated in the CIC test environment within an environment of 50 attacking machines, 420 victim PC’s and 30 victim servers during the period from February 14 to March 2, 2018. The dataset contains records from 14 distinct attacks, is labelled and presented together with anonymised PCAP<xref ref-type="fn" rid="j_infor457_fn_005">5</xref><fn id="j_infor457_fn_005"><label><sup>5</sup></label>
<p>File format as abbreviated from Packet CAPture, traffic capture file format in use by networking tools.</p></fn> files. 80 network traffic features were extracted and calculated using the CICFlowMeter tool. Ten CSV files are made available for machine learning, containing 16 232 943 records. The representation of classes in IDS-2018 ranges from approximately 1 : 20 to 1 : 100 000.</p>
<p>The following Table <xref rid="j_infor457_tab_006">6</xref> presents a summary of class representation of this dataset.</p>
<table-wrap id="j_infor457_tab_006">
<label>Table 6</label>
<caption>
<p>Class representation of CSE-CIC-IDS2018 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Traffic class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record count</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">13 484 708</td>
<td style="vertical-align: top; text-align: left">83.070%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">HOIC<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">686 012</td>
<td style="vertical-align: top; text-align: left">4.226%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LOIC-HTTP<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">576 191</td>
<td style="vertical-align: top; text-align: left">3.550%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Hulk<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">461 912</td>
<td style="vertical-align: top; text-align: left">2.846%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">286 191</td>
<td style="vertical-align: top; text-align: left">1.76%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-BruteForce</td>
<td style="vertical-align: top; text-align: left">193 360</td>
<td style="vertical-align: top; text-align: left">1.191%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SSH-Bruteforce</td>
<td style="vertical-align: top; text-align: left">187 589</td>
<td style="vertical-align: top; text-align: left">1.156%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infilteration</td>
<td style="vertical-align: top; text-align: left">161 934</td>
<td style="vertical-align: top; text-align: left">0.998%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SlowHTTPTest<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">139 890</td>
<td style="vertical-align: top; text-align: left">0.862%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GoldenEye<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">41 508</td>
<td style="vertical-align: top; text-align: left">0.256%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Slowloris<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">10 990</td>
<td style="vertical-align: top; text-align: left">0.068%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LOIC-UDP<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">1 730</td>
<td style="vertical-align: top; text-align: left">0.011%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-Web</td>
<td style="vertical-align: top; text-align: left">611</td>
<td style="vertical-align: top; text-align: left">0.004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-XSS</td>
<td style="vertical-align: top; text-align: left">230</td>
<td style="vertical-align: top; text-align: left">0.001%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">SQL Injection</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">87</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0005%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Variants of DoS attacks.</p>
</table-wrap-foot>
</table-wrap>
<p>Same dataset features as described in Section <xref rid="j_infor457_s_008">2.5</xref> are used further in this research for selection of features.</p>
</sec>
<sec id="j_infor457_s_010">
<label>2.7</label>
<title>LITNET-2020</title>
<p>LITNET-2020 is a new annotated network dataset for network intrusion detection, obtained from the real life Lithuanian academic network LITNET traffic by researchers from Kaunas Technology University (KTU). The environment of data collection, comparison of the dataset with other recently published network-intrusion datasets and description of attacks represented in the LITNET-2020 dataset is introduced by Damasevicius <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>). The dataset contains benign traffic of the academic network and 12 attack types generated at KTU managed LITNET network from March 6, 2019 to January 31, 2020. Network traffic was captured using the open source nfcapd binary format, anonymised and processed into the CSV format, containing 39 603 674 time-stamped records. Nfsen, MeSequel, and Python script tools were used for extra feature generation and pre-processing, with data fields in CSV format named after fields, generated by Nfdump.<xref ref-type="fn" rid="j_infor457_fn_006">6</xref><fn id="j_infor457_fn_006"><label><sup>6</sup></label>
<p>For a definition of features used in Nfdump 1.6 see <uri>https://github.com/phaag/nfdump/blob/master/bin/parse_csv.pl</uri>.</p></fn> The 49 attributes that are specific to the NetFlow v9 protocol as defined in RFC 3954 (Claise, <xref ref-type="bibr" rid="j_infor457_ref_010">2004</xref>) are used to form a dataset basis, further expanded with additional fields of time and tcp flags (in symbolic format), which can be used to identify attacks. An additional 19 attack specific attributes are added. The representation of classes in LITNET-2020 is imbalanced in a range from approximately 1 : 30 to 1 : 100 000.</p>
<p>The following Table <xref rid="j_infor457_tab_007">7</xref> presents a summary of class representation of this dataset.</p>
<table-wrap id="j_infor457_tab_007">
<label>Table 7</label>
<caption>
<p>Class representation of LITNET-2020 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Traffic class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record label</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record count<sup>1</sup></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Share, %</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">none</td>
<td style="vertical-align: top; text-align: left">36 423 860</td>
<td style="vertical-align: top; text-align: left">91.9709%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SYN Flood</td>
<td style="vertical-align: top; text-align: left">tcp_syn_f</td>
<td style="vertical-align: top; text-align: left">1 580 016</td>
<td style="vertical-align: top; text-align: left">3.9896%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Code Red</td>
<td style="vertical-align: top; text-align: left">tcp_red_w</td>
<td style="vertical-align: top; text-align: left">1 255 702</td>
<td style="vertical-align: top; text-align: left">3.1707%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Smurf</td>
<td style="vertical-align: top; text-align: left">icmp_smf</td>
<td style="vertical-align: top; text-align: left">118 958</td>
<td style="vertical-align: top; text-align: left">0.3004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">UDP Flood</td>
<td style="vertical-align: top; text-align: left">udp_f</td>
<td style="vertical-align: top; text-align: left">93 583</td>
<td style="vertical-align: top; text-align: left">0.2363%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LAND DoS</td>
<td style="vertical-align: top; text-align: left">tcp_land</td>
<td style="vertical-align: top; text-align: left">52 417</td>
<td style="vertical-align: top; text-align: left">0.1324%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">W32.Blaster</td>
<td style="vertical-align: top; text-align: left">tcp_w32_w</td>
<td style="vertical-align: top; text-align: left">24 291</td>
<td style="vertical-align: top; text-align: left">0.0613%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ICMP Flood</td>
<td style="vertical-align: top; text-align: left">icmp_f</td>
<td style="vertical-align: top; text-align: left">23 256</td>
<td style="vertical-align: top; text-align: left">0.0587%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">HTTP Flood</td>
<td style="vertical-align: top; text-align: left">http_f</td>
<td style="vertical-align: top; text-align: left">22 959</td>
<td style="vertical-align: top; text-align: left">0.0580%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Port Scan</td>
<td style="vertical-align: top; text-align: left">tcp_udp_win_p</td>
<td style="vertical-align: top; text-align: left">6 232</td>
<td style="vertical-align: top; text-align: left">0.0157%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Reaper Worm</td>
<td style="vertical-align: top; text-align: left">udp_reaper_w</td>
<td style="vertical-align: top; text-align: left">1 176</td>
<td style="vertical-align: top; text-align: left">0.0030%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Spam botnet</td>
<td style="vertical-align: top; text-align: left">smtp_b</td>
<td style="vertical-align: top; text-align: left">747</td>
<td style="vertical-align: top; text-align: left">0.0019%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Fragmentation</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">udp_0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">477</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0012%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Record counts before removing timestamp and related record duplicates.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="j_infor457_s_011" sec-type="methods">
<label>3</label>
<title>Methods</title>
<p>Multiple different types of methods were used in this research to improve performance of ML methods. The methods employed could be grouped into pre-processing (see Sections <xref rid="j_infor457_s_012">3.1</xref>–<xref rid="j_infor457_s_014">3.3</xref>) and machine learning methods (see Section <xref rid="j_infor457_s_016">3.5</xref>). Data record sampling methods are discussed in detail in Section <xref rid="j_infor457_s_012">3.1</xref>. Record over-sampling – in Section <xref rid="j_infor457_s_013">3.2</xref>, feature selection, scaling and frequency transformation undertaken and pre-processing activities are discussed in Section <xref rid="j_infor457_s_014">3.3</xref>. Machine learning methods (see Section <xref rid="j_infor457_s_016">3.5</xref>), capable of cost sensitive learning, were chosen for performance comparison in this paper.</p>
<p>For all models, their hyper-parameters were searched using the <italic>GridSearch</italic> method, and later multiple performance measures (see Section <xref rid="j_infor457_s_024">3.6</xref>) were used to evaluate and compare ML algorithms.</p>
<sec id="j_infor457_s_012">
<label>3.1</label>
<title>Under-Sampling Methods</title>
<p>The benign class in our datasets constitutes up to <inline-formula id="j_infor457_ineq_007"><alternatives><mml:math>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$90\% $]]></tex-math></alternatives></inline-formula> of total records. Fixed ratio random under-sampling, utilizing uniform distribution for record selection, of benign and over-represented malignant class records was implemented on data load for all datasets. Under-sampling refers to the process of reducing the number of samples in a dataset. Fixed ratio random under-sampling method aims to balance class distribution through the random-uniform elimination of majority class examples. It is worth noting that random under-sampling can discard potentially useful data that could be important for the machine learning process. Under-sampling methods can be categorized into two groups: (i) fixed ratio under-sampling and (ii) cleaning under-sampling (Lemaitre <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_039">2016</xref>). Fixed ratio under-sampling is based on a statistically random selection, which targets the provided absolute record numbers of a given class or a ratio, constituting a proportion of the total number of labels. Cleaning under-sampling is based on either (i) clustering, (ii) the nearest neighbour analysis, or (iii) classification accuracy (based on instance hardness threshold, Smith <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_061">2014</xref>).</p>
<p>Cleaning under-sampling approaches do not target a specific ratio, but rather clean the feature space based on some empirical criteria (Lemaitre <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_039">2016</xref>). According to Lemaitre <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_039">2016</xref>), these criteria are derived from the nearest neighbour rule, namely: (i) condensed nearest neighbours (Hart, <xref ref-type="bibr" rid="j_infor457_ref_024">1968</xref>), (ii) edited nearest neighbours (Wilson, <xref ref-type="bibr" rid="j_infor457_ref_069">1972</xref>), (iii) one-sided selection (Kubat and Matwin, <xref ref-type="bibr" rid="j_infor457_ref_033">1997</xref>), (iv) neighbourhood cleaning rule (Laurikkala, <xref ref-type="bibr" rid="j_infor457_ref_036">2001</xref>), and (v) Tomek links (Tomek, <xref ref-type="bibr" rid="j_infor457_ref_067">1976</xref>).</p>
<p>Cleaning under-sampling methods such as <italic>Edited Nearest Neighbours</italic>, <italic>TomekLinks</italic>, <italic>Condensed Nearest Neighbours</italic> were tested, however, due to the size of sub-sampled data and the large computational overhead they require, these methods were not further explored. The fixed random under-sampling was implemented in two steps as follows: 
<list>
<list-item id="j_infor457_li_020">
<label>1.</label>
<p>Major class records were first randomly under-sampled to a target number of records, such as to provide sufficient learning for all models. Target numbers were obtained after analysis of learning curves. Sufficient learning is defined here as the objective to have learning and testing curves to converge within a margin less than <inline-formula id="j_infor457_ineq_008"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$1\% $]]></tex-math></alternatives></inline-formula>, which for all models in this experiment occurs after approximately 0.6 million records.</p>
</list-item>
<list-item id="j_infor457_li_021">
<label>2.</label>
<p>Numbers of benign and other highly imbalanced classes were further transformed with a <italic>random under-sampling</italic> function from <italic>Imbalanced-learn</italic> library (Lemaitre <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_039">2016</xref>) using the number of records per class targets, calculated with the following empirically chosen skewed ratio function <inline-formula id="j_infor457_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>∗</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
</mml:mrow>
</mml:msqrt>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N\ast (1-\sqrt{(}s)/2)$]]></tex-math></alternatives></inline-formula> introduced in this research, where <italic>N</italic> is a number of initial records within a named class, where <italic>s</italic> is a share of records in that class. This proposed under-sampling method further on in this paper is referred to as <italic>Skewed fixed ratio under-sampling</italic>. The effect of this function is such that numbers of over-represented classes are decreased in a non linear manner, penalizing the best represented classes, while leaving the rare classes almost intact, thus simplifying, speeding up and decreasing the imbalance of the related learning of rare classes.</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_infor457_s_013">
<label>3.2</label>
<title>Over-Sampling Methods</title>
<p>In this paper, to balance minority classes, we investigate random and SMOTE (<italic>Synthetic Minority Over-sampling Technique</italic>) (Chawla <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_007">2002</xref>) over-sampling methods. Random over-sampling is a base method that aims to balance class distribution through the random replication of minority class examples. Unfortunately, this can increase the likelihood of classifier overfitting (Batista <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_002">2004</xref>). Therefore, we removed all duplicates in training data.</p>
<p>A more advanced method, capable of increasing minority class size without duplication, is SMOTE. SMOTE forms new minority class examples by linearly interpolating between minority class examples that are close. Thus, the overfitting problem risk is mitigated as the decision boundaries of the classifier for the minority class are moved further away from the minority class space. SMOTE works in feature space, not in data space, therefore, before the procedure to over-sample is executed, the first step is to select numeric features to over-sample, as it is not necessary to over-sample in all dimensions. SMOTE over-sampling is achieved by following these steps: a) take <italic>k</italic> nearest neighbours from minority class for some minority class vector in the feature space, b) randomly choose the vector from those <italic>k</italic> neighbours, c) take a difference between the vector and its neighbour, and multiply the difference vector by a random number which lies between 0, and 1, d) repeat previous step until the target number of synthetic points is reached. After this, new records can be added to the current data (see Chawla <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_007">2002</xref>, for a complete algorithm). SMOTE method can be combined with some under-sampling methods to remove examples of all classes that tend to be misclassified. For example, in SMOTE with the <italic>Edited Nearest Neighbours</italic> (<italic>ENN</italic>) algorithm (Batista <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_002">2004</xref>), after SMOTE is used to over-sample a number of records in defined minority classes, <italic>ENN</italic> is used to remove samples from both classes such that any sample that is misclassified by its given number of nearest neighbours is removed from the training set. Batista <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_002">2004</xref>) have demonstrated the best results on imbalanced datasets with minority classes containing under 100 records. However, due to the complexity of the edited neighbours procedures (Witten <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_071">2005</xref>) being <inline-formula id="j_infor457_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathcal{O}(nkl)$]]></tex-math></alternatives></inline-formula>, where <italic>n</italic> is a number of samples, <italic>d</italic> – a number of dimensions (features) and <italic>k</italic> – a number of nearest neighbours, this solution is resource intensive.</p>
<p>As our datasets have not only continuous but also nominal features, we used a modification of SMOTE – <italic>Synthetic Minority Over-sampling Technique-Nominal Continuous (SMOTE-NC)</italic>, from <italic>imbalanced-learn</italic> library (Lemaître <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_040">2017</xref>) in the research. We used a recommended number of neighbours equal to <inline-formula id="j_infor457_ineq_011"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5</mml:mn></mml:math><tex-math><![CDATA[$k=5$]]></tex-math></alternatives></inline-formula>, and separated categorical and numeric features before over-sampling.</p>
</sec>
<sec id="j_infor457_s_014">
<label>3.3</label>
<title>Feature Selection Methods</title>
<p>Based on the ideas of research and practical implementation recommendations made by Sharafaldin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) and Shetye (<xref ref-type="bibr" rid="j_infor457_ref_059">2019</xref>), a selection of features was tested with 3 classes of methods: (a) filtering – correlation and related heat map analysis (b) univariate – recursive feature elimination and (c) iterative – regularization methods. In this research, features were selected with <italic>SelectKBest</italic> from <italic>Scikit-learn</italic> library (Pedregosa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_048">2011</xref>). The SelectKBest method takes as a parameter a score function, such as <inline-formula id="j_infor457_ineq_012"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">χ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\chi ^{2}}$]]></tex-math></alternatives></inline-formula> or Anova <italic>F-value</italic>, or information gain function and retains the first <italic>k</italic> features with the highest scores.</p>
<p>If the Anova <italic>F-value</italic> function is used, a test result is considered statistically significant if it is unlikely to have occurred by chance, assuming the truth of the null hypothesis. If <inline-formula id="j_infor457_ineq_013"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">χ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\chi ^{2}}$]]></tex-math></alternatives></inline-formula> is used as a score function, <italic>SelectKBest</italic> will compute the <inline-formula id="j_infor457_ineq_014"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">χ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\chi ^{2}}$]]></tex-math></alternatives></inline-formula> statistic between each feature of <italic>X</italic> and <italic>y</italic> (assumed to be class labels). A small value will mean the feature is independent of <italic>y</italic>. A large value will mean the feature is non-randomly related to <italic>y</italic>, and so likely to provide important information. Only <italic>k</italic> features will be retained. Mutual information (information gain) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, whereas higher values mean higher dependency. Mutual information methods can capture any kind of statistical dependency, but being non-parametric (Ross, <xref ref-type="bibr" rid="j_infor457_ref_055">2014</xref>), it requires more samples for accurate estimation and is computationally more expensive, therefore, as a result of a better time performance in this research, Anova <italic>F-value</italic> was selected.</p>
<p>Embedded methods penalize a feature based on a coefficient threshold. On each iteration of the model training process those features which contribute the most to the training for a particular iteration are selected.</p>
<p>Further on in this paper, two methods, the filtering and <italic>SelectKBest</italic> from <italic>Scikit-Learn</italic> were used to select features.</p>
<p>When performing feature selection, <italic>SelectKBest</italic> is focusing on the largest classes, therefore a possible improvement would be to do feature selection in a pipeline, by firstly selecting the most important features for the rarest class and then adding features needed for every class.</p>
<p>Generating additional synthetic features was not attempted in this research, as all chosen datasets contain a significant number of such.</p>
</sec>
<sec id="j_infor457_s_015">
<label>3.4</label>
<title>Cost-Sensitive Learning Methods</title>
<p>Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model (Brownlee, <xref ref-type="bibr" rid="j_infor457_ref_005">2020</xref>).</p>
<p>If not configured, machine learning algorithms assume that all misclassification errors made by a model are equal. In case of an intrusion detection problem, missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.</p>
<p>The simplest and most popular approach to implementing cost-sensitive learning is to penalize the model less for training errors made on examples from the minority class by adjusting weights. The decision tree algorithm can be modified to weight model error by class weight when selecting splits. The Heuristic rule, also confirmed with intuition from decision trees (Brownlee, <xref ref-type="bibr" rid="j_infor457_ref_005">2020</xref>), is to invert the ratio of the class distribution in the training dataset.</p>
<p>In this research, weights adjustment for decision trees was implemented using <italic>Scikit-learn</italic> library model parameters <italic>class_weight</italic>, setting it to <italic>‘balanced’</italic>, which does the above mentioned inversion of class weights. Prior statistics were used for <italic>Quadratic discriminant analysis</italic> model.</p>
</sec>
<sec id="j_infor457_s_016">
<label>3.5</label>
<title>Choice of Machine Learning Methods</title>
<p>For a performance comparison of machine learning methods on network intrusion detection data with imbalanced classes, we selected the most popular machine learning algorithms from surveys and review papers, related to intrusion detection (Buczak and Guven, <xref ref-type="bibr" rid="j_infor457_ref_006">2016</xref>; Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>; Damasevicius <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_011">2020</xref>).</p>
<sec id="j_infor457_s_017">
<label>3.5.1</label>
<title>Adaptive Boosting (Adaboost)</title>
<p><italic>AdaBoost</italic> ensemble method was proposed by Yoav Freund and Robert Shapire for generating a strong classifier from a set of weak classifiers (Freund and Schapire, <xref ref-type="bibr" rid="j_infor457_ref_018">1997</xref>). <italic>AdaBoost</italic> algorithm works by weighting instances in the dataset by how easy or difficult they are to classify, and correspondingly prioritizes them in the construction of subsequent models. A Default base classifier was used with <italic>Adaboost</italic> by authors of the CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) obtaining the result on <italic>Precision</italic> and <inline-formula id="j_infor457_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.77 whereas <italic>Recall</italic> at 0.84. Yulianto <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_072">2019</xref>) used SMOTE, <italic>Principal Component Analysis (PCA)</italic>, and <italic>Ensemble Feature Selection (EFS)</italic> to improve the performance of <italic>AdaBoost</italic> on the CIC-IDS-2017 dataset achieving <italic>Accuracy, Precision, Recall</italic>, and <inline-formula id="j_infor457_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> scores of 0.818, 0.818, 1.000, and 0.900, respectively.</p>
</sec>
<sec id="j_infor457_s_018">
<label>3.5.2</label>
<title>Classification and Regression Tree (CART)</title>
<p>The <italic>Classification and Regression Tree</italic> method was proposed by Breiman <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>), and used to construct tree structured rules from training data. Tree split points are chosen on a basis of cost function minimization.</p>
<p>The authors of the CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) obtained weighted averages of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.98 using ID3 (<italic>Iterative Dichotomiser 3</italic>), introduced by Quinlan (<xref ref-type="bibr" rid="j_infor457_ref_049">1986</xref>).</p>
<p>In this research, CART, as implemented in <italic>Scikit-learn</italic> library, was also used to obtain a base classifier and tree parameters for <italic>Adaboost</italic>, <italic>Gradient Boosting Classifier</italic> and <italic>Random Forest Classifier</italic>. Tree depth and alpha were obtained using the method of maximum cost path analysis (Breiman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>), implemented in the <italic>Scikit-learn</italic> library cost-complexity-pruning-path function, discussed in Section <xref rid="j_infor457_s_027">3.8</xref>.</p>
</sec>
<sec id="j_infor457_s_019">
<label>3.5.3</label>
<title>k-Nearest Neighbours (KNN)</title>
<p>The <italic>k-Nearest Neighbours</italic> method was proposed by Dudani (<xref ref-type="bibr" rid="j_infor457_ref_014">1976</xref>), as a method which makes use of a neighbour weighting function for the purpose of assigning a class to an unclassified sample. KNN was used by authors of the CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) with obtained results for weighted averages of <italic>Precision, Recall</italic> and <inline-formula id="j_infor457_ineq_018"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.96. The <italic>KNN</italic> algorithm in Scikit-learn by default uses the Euclidean distance as a distance metric for the k-NN algorithm. However, this is not appropriate when the domain presents qualitative attributes or categorical features of a different domain. For those domains, the distance for qualitative attributes is usually calculated using the overlap function, in which the value 0 (if two examples have the same value for a given attribute) or the value 1 (if these values differ) are assigned. In this research we have used the <italic>Manhattan</italic> dimension with positive effect obtained in the experiments.</p>
</sec>
<sec id="j_infor457_s_020">
<label>3.5.4</label>
<title>Quadratic Discriminant Analysis (QDA)</title>
<p><italic>Quadratic discriminant analysis</italic> descends from discriminant analysis introduced by Fisher (<xref ref-type="bibr" rid="j_infor457_ref_017">1954</xref>). Bayesian estimation for QDA was first proposed by Geisser (<xref ref-type="bibr" rid="j_infor457_ref_022">1964</xref>). <italic>Quadratic discriminant analysis</italic> (QDA) models the likelihood of each class as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test point (Friedman, <xref ref-type="bibr" rid="j_infor457_ref_019">2001</xref>). The method is sensitive to the knowledge of priors. QDA was used by authors of the CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) with obtained result for <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.97, 0.88 and 0.92.</p>
</sec>
<sec id="j_infor457_s_021">
<label>3.5.5</label>
<title>Random Forest Trees (RFT)</title>
<p>The <italic>Random Forest Trees (RFT)</italic> classifier was proposed by Breiman (<xref ref-type="bibr" rid="j_infor457_ref_003">2001</xref>) as a combination of tree predictors minimizing overall generalization error of participating trees as the number of trees in the forest becomes larger. Random forests are an alternative to <italic>Adaboost</italic> by Freund and Schapire (<xref ref-type="bibr" rid="j_infor457_ref_018">1997</xref>) and are more robust with respect to noise. Random Forests is an extension of bagged decision trees where only a random subset of features are considered for each split.</p>
<p>The algorithm was used by the authors of the CIC-IDS-2017 dataset (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>), and also by Kurniabudi <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_034">2020</xref>). Sharafaldin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) obtained results for the weighted averages of <italic>Precision, Recall</italic> and <inline-formula id="j_infor457_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.98, 0.97, and 0.97. In a study by Kurniabudi <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_034">2020</xref>) the <italic>Random Forest</italic> algorithm has <italic>Accuracy, Precision and Recall</italic> of respectively 0.998 using the 15–22 selected features. These metrics were estimated for the benign and attack class.</p>
</sec>
<sec id="j_infor457_s_022">
<label>3.5.6</label>
<title>Gradient Boosting Classifier (GBC)</title>
<p>In order to extend the scope of the research, <italic>Gradient Boosting Classifier (GBC)</italic>, as proposed by Friedman (<xref ref-type="bibr" rid="j_infor457_ref_019">2001</xref>) and Friedman (<xref ref-type="bibr" rid="j_infor457_ref_020">2002</xref>), was added as a natural member of classifier ensemble methods. GBC is a stochastic gradient boosting algorithm, where decision trees are fitted on the negative gradient of the chosen loss function. The idea of gradient boosting is to fit the base-learner not to re-weighted observations, as in <italic>AdaBoost</italic>, but to the negative gradient vector of the loss function evaluated at the previous iteration. <italic>XGBoost</italic> library (Chen and Guestrin, <xref ref-type="bibr" rid="j_infor457_ref_008">2016</xref>), incarnation with GPU support of GBC, was implemented in this research. The results of GBC of other authors are not known publicly.</p>
</sec>
<sec id="j_infor457_s_023">
<label>3.5.7</label>
<title>Multiple Layer Perceptron</title>
<p><italic>Multiple Layer Perceptron (MLP)</italic> has been proposed by Rosenblatt (<xref ref-type="bibr" rid="j_infor457_ref_054">1962</xref>) as an extension to a linear perceptron model (Rosenblatt, <xref ref-type="bibr" rid="j_infor457_ref_053">1957</xref>). It is a supervised learning artificial neural network implementation, utilizing back-propagation for training, that can have multiple layers and a chosen, non necessarily linear, activation function.</p>
<p>MLP was used in the study of Sharafaldin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) with obtained results for weighted averages of <italic>Precision, Recall</italic> and <inline-formula id="j_infor457_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> of 0.77, 0.83, and 0.76.</p>
</sec>
</sec>
<sec id="j_infor457_s_024">
<label>3.6</label>
<title>Performance Measures</title>
<p>Standard performance metrics for classifiers are presented in Section <xref rid="j_infor457_s_025">3.6.1</xref>, and <italic>Bias and Variance decomposition</italic> metric (see Section <xref rid="j_infor457_s_026">3.7</xref>) was used to evaluate ML algorithm tendencies to overfit or underfit.</p>
<sec id="j_infor457_s_025">
<label>3.6.1</label>
<title>Confusion Matrix Based Metrics</title>
<p><italic>Accuracy</italic>, <italic>Precision</italic> in equation (<xref rid="j_infor457_eq_005">5</xref>), <italic>Recall</italic> in equation (<xref rid="j_infor457_eq_003">3</xref>) and <inline-formula id="j_infor457_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> in equation (<xref rid="j_infor457_eq_006">6</xref>), are very sensitive to the representation of classes in the source datasets (Sokolova and Lapalme, <xref ref-type="bibr" rid="j_infor457_ref_062">2009</xref>). Results change if proportions of class samples change (Tharwat, <xref ref-type="bibr" rid="j_infor457_ref_064">2018</xref>). In their study Garcia <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_021">2010</xref>) review most of the performance measures used for imbalanced classes, introducing a new measure called <italic>Index of Balanced Accuracy</italic> (<italic>IBA</italic>) currently implemented and used in the classification report of <italic>Imbalanced-learn</italic> library (Lemaitre <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_039">2016</xref>) for calculating <italic>Geometric mean</italic> of recall <inline-formula id="j_infor457_ineq_023"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, equation (<xref rid="j_infor457_eq_004">4</xref>) introduced by Kubat and Matwin (<xref ref-type="bibr" rid="j_infor457_ref_033">1997</xref>). An experimental comparison of performance measures for classification is presented by Ferri <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_016">2009</xref>). Mosley (<xref ref-type="bibr" rid="j_infor457_ref_046">2013</xref>) reviews multi-class data performance metrics such as <italic>Recall</italic>, <inline-formula id="j_infor457_ineq_024"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, <italic>Relative Classifier Information</italic> (<italic>RCI</italic>) (Wei <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_068">2010</xref>), <italic>Matthew’s Correlation Coefficient</italic> (<italic>MCC</italic>) (Matthews, <xref ref-type="bibr" rid="j_infor457_ref_045">1975</xref>), <italic>Confusion Entropy</italic> (<italic>CEN</italic>) (Jurman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_027">2012</xref>). It is important to note that Chicco and Jurman (<xref ref-type="bibr" rid="j_infor457_ref_009">2020</xref>) demonstrated that <italic>MCC</italic> and <italic>CEN</italic> cannot be reliably used in case of an imbalance of data classes and these will not be discussed in this paper. Mosley (<xref ref-type="bibr" rid="j_infor457_ref_046">2013</xref>) introduces a <italic>per-class Balanced Accuracy</italic> (also known as <italic>Balanced accuracy score (BAS)</italic>), see equation (<xref rid="j_infor457_eq_002">2</xref>) which is based on recall and neglects the precision. However, <italic>Precision</italic> is very sensitive to attributions of records from other classes, which was clearly observed during this research. In the case of imbalance, it mainly indicates a false classification of major classes, therefore, it has been chosen to be studied in this research.</p>
<p>Further on in this research, the <italic>Balanced accuracy score</italic> and <inline-formula id="j_infor457_ineq_025"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula> along with <italic>Precision</italic> were chosen as classification quality quantification metrics for comparison because: (i) these metrics were previously used by other researchers to measure performance of learning in imbalanced multi-class problems, while datasets used in this studyx have extremely imbalanced class distributions, (ii) these measures are available in popular and open source software libraries like <italic>Scikit-learn</italic> and <italic>Imbalanced-learn</italic>, (iii) metrics have simple and clear intuition for use in practical cyber-security applications, (iv) precision also allows for comparison with other research. <italic>Macro</italic> score averages were calculated in further experiment to give equal weight to each class, avoiding of scaling with respect to number of instances per class.</p>
<p>Balanced accuracy score <italic>BAS</italic> in formula (<xref rid="j_infor457_eq_002">2</xref>) is further defined as average of recall values for <italic>K</italic> classes: 
<disp-formula id="j_infor457_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">BAS</mml:mtext>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Recall</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{BAS}=\frac{1}{K}{\sum \limits_{i=1}^{K}}{\textit{Recall}_{i}},\]]]></tex-math></alternatives>
</disp-formula> 
where: 
<disp-formula id="j_infor457_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Recall</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">TP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">TP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">FN</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\textit{Recall}_{i}}=\frac{{\textit{TP}_{i}}}{{\textit{TP}_{i}}+{\textit{FN}_{i}}}=\frac{{c_{ii}}}{{\textstyle\textstyle\sum _{j=1}^{k}}{c_{ij}}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_infor457_ineq_026"><alternatives><mml:math>
<mml:mtext mathvariant="italic">TP</mml:mtext></mml:math><tex-math><![CDATA[$\textit{TP}$]]></tex-math></alternatives></inline-formula> stands for True Positive, and <inline-formula id="j_infor457_ineq_027"><alternatives><mml:math>
<mml:mtext mathvariant="italic">FN</mml:mtext></mml:math><tex-math><![CDATA[$\textit{FN}$]]></tex-math></alternatives></inline-formula> stands for False Negative, <italic>i</italic> is a number of class in question and <italic>k</italic> is the number of classes in the dataset. <inline-formula id="j_infor457_ineq_028"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">TP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{TP}_{i}}$]]></tex-math></alternatives></inline-formula> is True Positive (correct classified) for class <italic>i</italic>, and <inline-formula id="j_infor457_ineq_029"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">FN</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\textit{FN}_{i}}$]]></tex-math></alternatives></inline-formula> are all false negative instances for the class <italic>i</italic>. <inline-formula id="j_infor457_ineq_030"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{ij}}$]]></tex-math></alternatives></inline-formula> is an element of the confusion matrix in row <italic>i</italic> and column <italic>j</italic>.</p>
<p><italic>Geometric mean</italic> <inline-formula id="j_infor457_ineq_031"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula> of sensitivity is defined as follows: 
<disp-formula id="j_infor457_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
<mml:mo>=</mml:mo><mml:mroot>
<mml:mrow>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∏</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">Recall</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:mroot>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \bar{G}=\sqrt[k]{{\prod \limits_{i=1}^{k}}{\mathrm{Recall}_{i}}},\]]]></tex-math></alternatives>
</disp-formula> 
where <italic>k</italic> is a number of classes in a dataset.</p>
<p>Precision for class <italic>i</italic> is defined as follows: 
<disp-formula id="j_infor457_eq_005">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Precision</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">TP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">TP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">FP</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\textit{Precision}_{i}}=\frac{{\textit{TP}_{i}}}{{\textit{TP}_{i}}+{\textit{FP}_{i}}}=\frac{{c_{ii}}}{{\textstyle\textstyle\sum _{j=1}^{k}}{c_{ji}}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Whereas <inline-formula id="j_infor457_ineq_032"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> for class <italic>i</italic> is defined as follows: 
<disp-formula id="j_infor457_eq_006">
<label>(6)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Precision</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>+</mml:mo><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">Recall</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {{F_{1}}_{i}}=\frac{2}{\frac{1}{{\textit{Precision}_{i}}}+\frac{1}{{\textit{Recall}_{i}}}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In this research, we have used macro-weighted (i.e. unweighted mean) <inline-formula id="j_infor457_ineq_033"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, <italic>Precision</italic> and <inline-formula id="j_infor457_ineq_034"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula>, if it is not specified otherwise.</p>
</sec>
</sec>
<sec id="j_infor457_s_026">
<label>3.7</label>
<title>Bias and Variance Decomposition</title>
<p>The decomposition of the loss into bias and variance helps to improve understanding of generalization capacities of compared learning algorithms, such as overfitting and underfitting. Various methods of decomposition are reviewed in Domingos (<xref ref-type="bibr" rid="j_infor457_ref_012">2000</xref>). It has been demonstrated that high variance correlates to overfitting, and high bias correlates to underfitting. In practical terms, when comparing the performance of learning algorithms, models with lower bias and variance over the same test data would be preferred. It is worth noting that models with a higher degree of parameter freedom tend to demonstrate lower bias and higher variance, and models with a low degree of freedom demonstrate high bias and lower variance.</p>
<p>The loss function of a learning algorithm can be decomposed into three terms: a variance, a bias, and a noise term, which will be ignored further for simplicity (Raschka, <xref ref-type="bibr" rid="j_infor457_ref_050">2018</xref>). Loss function depends on the machine learning algorithm. For decision trees (<italic>CART</italic>), training proceeds through a greedy search, each step based on information gain. For the random forest classifier, loss function is the <italic>Gini impurity. Cross-entropy</italic> is the default loss function to use for multi-class classification problems with <italic>MLP</italic>.</p>
<p>The prediction bias is calculated as the difference between the expected prediction accuracy of a model and the true prediction accuracy (equation (<xref rid="j_infor457_eq_007">7</xref>)). In formal notation the <inline-formula id="j_infor457_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">B</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi></mml:math><tex-math><![CDATA[$Bias$]]></tex-math></alternatives></inline-formula> of an estimator <inline-formula id="j_infor457_ineq_036"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\beta }$]]></tex-math></alternatives></inline-formula> is the difference between its expected value <inline-formula id="j_infor457_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$E[\hat{\beta }]$]]></tex-math></alternatives></inline-formula> and the true value of a parameter <italic>β</italic> being estimated (Raschka, <xref ref-type="bibr" rid="j_infor457_ref_050">2018</xref>): 
<disp-formula id="j_infor457_eq_007">
<label>(7)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">Bias</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">β</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{Bias}=E[\hat{\beta }]-\beta .\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The variance (equation (<xref rid="j_infor457_eq_008">8</xref>)) is a measure of the variability of model’s predictions if the learning process is repeated multiple times with random fluctuations in the training set. 
<disp-formula id="j_infor457_eq_008">
<label>(8)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mtext mathvariant="italic">Variance</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">[</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">(</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo fence="true" stretchy="false">]</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="1.19em" minsize="1.19em">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">]</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \textit{Variance}=E\big[{\big(\hat{\beta }-E[\hat{\beta }]\big)^{2}}\big].\]]]></tex-math></alternatives>
</disp-formula> 
<italic>Variance</italic> is obtained by repeating prediction on a model trained on stratified shuffle-split training data. The more sensitive the model-building process is towards fluctuations of the training data, the higher the variance (Raschka, <xref ref-type="bibr" rid="j_infor457_ref_050">2018</xref>).</p>
</sec>
<sec id="j_infor457_s_027">
<label>3.8</label>
<title>Tree Pruning</title>
<p>Finding the values where training and testing learning curves converge allows for creation of better generalizing decision trees, decrease of overfitting and underfitting. The Tree depth (implemented in <italic>Scikit-learn</italic> library through parameter <inline-formula id="j_infor457_ineq_038"><alternatives><mml:math>
<mml:mtext mathvariant="italic">max</mml:mtext>
<mml:mtext>_</mml:mtext>
<mml:mtext mathvariant="italic">depth</mml:mtext></mml:math><tex-math><![CDATA[$\textit{max}\text{\_}\textit{depth}$]]></tex-math></alternatives></inline-formula>) and <italic>α</italic> (implemented in <italic>Scikit-learn</italic> library through parameter <inline-formula id="j_infor457_ineq_039"><alternatives><mml:math>
<mml:mtext mathvariant="italic">ccp</mml:mtext>
<mml:mtext>_</mml:mtext>
<mml:mtext mathvariant="italic">alpha</mml:mtext></mml:math><tex-math><![CDATA[$\textit{ccp}\text{\_}\textit{alpha}$]]></tex-math></alternatives></inline-formula>) were obtained using the method of maximum cost path analysis (Breiman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>), implemented in <italic>Scikit-learn</italic> library <italic>cost-complexity-pruning-path</italic> function and searching for a minimum of Bias and Variance. In this algorithm the cost-complexity measure <inline-formula id="j_infor457_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${R_{\alpha }}(T)$]]></tex-math></alternatives></inline-formula> of a given tree <italic>T</italic> is defined in formula (<xref rid="j_infor457_eq_009">9</xref>) as follows: 
<disp-formula id="j_infor457_eq_009">
<label>(9)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo stretchy="false">|</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">˜</mml:mo></mml:mover>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {R_{\alpha }}(T)=R(T)+\alpha |\tilde{T}|,\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_infor457_ineq_041"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">˜</mml:mo></mml:mover>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|\tilde{T}|$]]></tex-math></alternatives></inline-formula> is the number of terminal nodes in <italic>T</italic>, <inline-formula id="j_infor457_ineq_042"><alternatives><mml:math>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">T</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$R(T)$]]></tex-math></alternatives></inline-formula> is defined as the total mis-classification cost of the terminal nodes for the complexity parameter <italic>α</italic> (⩾0). As <italic>α</italic> increases, more descendent nodes are pruned.</p>
</sec>
<sec id="j_infor457_s_028">
<label>3.9</label>
<title>Variance Inflation Factor</title>
<p>Many variables in the datasets CIC-IDS2017 and CSE-CIC-IDS2018 appear to be correlated with each other, which increases bias while using <italic>Quadratic Discriminate Analysis</italic>. A statistical measure known as <italic>VIF</italic> (Variance Inflation Factor) was proposed by Lin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_041">2011</xref>) to support elimination of cross-correlation of features and is implemented in this research from <italic>statsmodels</italic> library (Seabold and Perktold, <xref ref-type="bibr" rid="j_infor457_ref_056">2010</xref>).</p>
</sec>
<sec id="j_infor457_s_029">
<label>3.10</label>
<title>Other Methods</title>
<p>The number of estimators was obtained using the <italic>Scikit-learn’s GridSearch</italic> (LaValle <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_037">2004</xref>) method. See Sections <xref rid="j_infor457_s_034">4.4</xref>–<xref rid="j_infor457_s_035">4.5</xref> and Table <xref rid="j_infor457_tab_015">15</xref> for implementation details in this research.</p>
</sec>
</sec>
<sec id="j_infor457_s_030">
<label>4</label>
<title>Experiment Design</title>
<p>Our experiment contained pre-processing, described further in detail in Section <xref rid="j_infor457_s_031">4.1</xref> for the CIC-IDS2017 dataset, Section <xref rid="j_infor457_s_032">4.2</xref> for the CSE-CIC-IDS2018 dataset and Section <xref rid="j_infor457_s_033">4.3</xref> for the LITNET-2020 dataset. The datasets were cleaned and normalized. <italic>Quantile</italic> transformation from <italic>Scikit-learn</italic> library (Pedregosa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_048">2011</xref>) with <italic>QuantileTransformer</italic> using a default of 1 000 quantiles has been implemented for the pre-processing of numeric (continuous time related) features of all datasets in order to transform original values to a more uniform distribution.</p>
<p>Datasets were further under-sampled with random fixed ratio under-sampling and proposed skewed fixed ratio under-sampling so that after splitting into testing and training, sets would contain more than approximately 600 000 records each, which is sufficient for learning of all algorithms. This number has been estimated by performing learning curve analysis.</p>
<p>Later on, the training subsets were over-sampled using SMOTE for CIC-IDS2017 and CIC-IDS2018 datasets and SMOTE-NC for LITNET-2020. Features were selected using <italic>KBest</italic> (see Section <xref rid="j_infor457_s_014">3.3</xref>) and <italic>VIF</italic> procedures (see Section <xref rid="j_infor457_s_028">3.9</xref>). Training and hyper-parameter search was performed using cross validation with <inline-formula id="j_infor457_ineq_043"><alternatives><mml:math>
<mml:mtext mathvariant="italic">CV</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>20</mml:mn></mml:math><tex-math><![CDATA[$\textit{CV}=20$]]></tex-math></alternatives></inline-formula> on stratified shuffle split samples of training datasets.</p>
<p>The final results of predictions were obtained using testing data, e.g. not seen to trained models. In order to obtain a reliable result, predictions were run 30 times with a change of random seed on each run.</p>
<p>Further on in the experiment, the best features were selected using the <italic>SelectKBest</italic> procedure from <italic>Scikit-learn</italic> library (Pedregosa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_048">2011</xref>) and followed by Variance inflation factor analysis (Lin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_041">2011</xref>) with a target threshold value, to eliminate variables with high collinearity.</p>
<p>Parameters for classification models were searched using <italic>GridSearch</italic> from the <italic>Scikit-learn</italic> library.</p>
<sec id="j_infor457_s_031">
<label>4.1</label>
<title>CIC-IDS2017 Pre-Processing Steps</title>
<p>The following procedures were implemented to condition the dataset for better learning of under-represented attack classes: a) removal of unused features and related record duplicates, b) random under-sampling of benign class records, such as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves and c) over-sampling using SMOTE for the training sub-sample of extremely rare records (see Table <xref rid="j_infor457_tab_004">4</xref>) up until the minimum number of examples of classes with high imbalance.</p>
<p>Duplicate rows were removed (leaving the first one), see Table <xref rid="j_infor457_tab_008">8</xref>.</p>
<table-wrap id="j_infor457_tab_008">
<label>Table 8</label>
<caption>
<p>Removal of duplicates in IDS2017 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Share of removed records (%),</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting counts<sup>1</sup></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">7.770%</td>
<td style="vertical-align: top; text-align: left">2 096 484</td>
<td style="vertical-align: top; text-align: left">83.1159%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Hulk</td>
<td style="vertical-align: top; text-align: left">25.197%</td>
<td style="vertical-align: top; text-align: left">172 849</td>
<td style="vertical-align: top; text-align: left">6.8527%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PortScan</td>
<td style="vertical-align: top; text-align: left">42.856%</td>
<td style="vertical-align: top; text-align: left">90 819</td>
<td style="vertical-align: top; text-align: left">3.6006%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDoS</td>
<td style="vertical-align: top; text-align: left">0.009%</td>
<td style="vertical-align: top; text-align: left">128 016</td>
<td style="vertical-align: top; text-align: left">5.0752%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS GoldenEye</td>
<td style="vertical-align: top; text-align: left">0.068%</td>
<td style="vertical-align: top; text-align: left">10 286</td>
<td style="vertical-align: top; text-align: left">0.4078%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-Patator</td>
<td style="vertical-align: top; text-align: left">25.258%</td>
<td style="vertical-align: top; text-align: left">5 933</td>
<td style="vertical-align: top; text-align: left">0.2352%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SSH-Patator</td>
<td style="vertical-align: top; text-align: left">45.413%</td>
<td style="vertical-align: top; text-align: left">3 219</td>
<td style="vertical-align: top; text-align: left">0.1276%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS slowloris</td>
<td style="vertical-align: top; text-align: left">7.091%</td>
<td style="vertical-align: top; text-align: left">5 385</td>
<td style="vertical-align: top; text-align: left">0.2135%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Slowhttptest</td>
<td style="vertical-align: top; text-align: left">4.928%</td>
<td style="vertical-align: top; text-align: left">5 228</td>
<td style="vertical-align: top; text-align: left">0.2073%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">0.661%</td>
<td style="vertical-align: top; text-align: left">1 953</td>
<td style="vertical-align: top; text-align: left">0.0774%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack – Brute Force</td>
<td style="vertical-align: top; text-align: left">2.455%</td>
<td style="vertical-align: top; text-align: left">1 470</td>
<td style="vertical-align: top; text-align: left">0.0583%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-XSS</td>
<td style="vertical-align: top; text-align: left">0.000%</td>
<td style="vertical-align: top; text-align: left">652</td>
<td style="vertical-align: top; text-align: left">0.0258%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infiltration</td>
<td style="vertical-align: top; text-align: left">0.000%</td>
<td style="vertical-align: top; text-align: left">36</td>
<td style="vertical-align: top; text-align: left">0.0014%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-Sql Injection</td>
<td style="vertical-align: top; text-align: left">0.000%</td>
<td style="vertical-align: top; text-align: left">21</td>
<td style="vertical-align: top; text-align: left">0.0008%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Heartbleed</td>
<td style="vertical-align: top; text-align: left">0.000%</td>
<td style="vertical-align: top; text-align: left">11</td>
<td style="vertical-align: top; text-align: left">0.0004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2 522 362</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Record counts after removing duplicate records.</p>
</table-wrap-foot>
</table-wrap>
<p>The following 8 features ‘<italic>Bwd PSH Flags</italic>’, ‘<italic>Bwd URG Flags</italic>’, ‘<italic>Fwd Avg Bytes/Bulk</italic>’, ‘<italic>Fwd Avg Packets/Bulk</italic>’, ‘<italic>Fwd Avg Bulk Rate</italic>’, ‘<italic>Bwd Avg Bytes/Bulk</italic>’, ‘<italic>Bwd Avg Packets/Bulk</italic>’, ‘<italic>Bwd Avg Bulk Rate</italic>’, containing no information <inline-formula id="j_infor457_ineq_044"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mtext mathvariant="italic">Std</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\textit{Std}=0)$]]></tex-math></alternatives></inline-formula> in all loaded files and duplicate feature ‘<italic>Fwd Header Length.1</italic>’ <inline-formula id="j_infor457_ineq_045"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mtext mathvariant="italic">corr</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\textit{corr}=1)$]]></tex-math></alternatives></inline-formula> with ‘<italic>Fwd Header Length</italic>’ were removed.</p>
<p>After dropping the duplicates, the 2 522 362 remaining records were investigated for missing values and infinities.</p>
<p>As a result, 1 358 missing values containing records were removed with drop duplicates. The remaining 353 rows with missing values were found to be split between ‘<italic>Benign</italic>’ (350) and ‘<italic>DoS Hulk</italic>’ (3) classes and missing values were replaced with −1.</p>
<p>Further, 1 211 records with infinities in two features Flow ‘<italic>Bytes/s</italic>’ and ‘<italic>Flow Packets/s</italic>’ were found and replaced by maximums of values per class, see Table <xref rid="j_infor457_tab_009">9</xref>.</p>
<table-wrap id="j_infor457_tab_009">
<label>Table 9</label>
<caption>
<p>Replacing infinities in IDS2017 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record count</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Flow Bytes/s</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Flow Packets/s</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">1 077</td>
<td style="vertical-align: top; text-align: left">2.071e+09</td>
<td style="vertical-align: top; text-align: left">4.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PortScan</td>
<td style="vertical-align: top; text-align: left">125</td>
<td style="vertical-align: top; text-align: left">8.00e+06</td>
<td style="vertical-align: top; text-align: left">2.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">1.20e+07</td>
<td style="vertical-align: top; text-align: left">2.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-Patator</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1.40e+07</td>
<td style="vertical-align: top; text-align: left">3.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDoS</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">3.47e+08</td>
<td style="vertical-align: top; text-align: left">2.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total:</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1211</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>This processing step is made under an assumption that such a replacement for lost values would be possible to implement after learning the values during the initial training of a real life intrusion detection system.</p>
<p>Further numbers of records for <italic>Benign</italic> and second largest class <italic>Dos Hulk</italic> were transformed with a skewed fixed ratio under-sampling. Remaining data is split into test and train sub-samples. The training sub-set is then over-sampled with SMOTE (thus training record count values of 4 999 and 2 999 in Table <xref rid="j_infor457_tab_010">10</xref>). This procedure keeps all extremely imbalanced class records (Table <xref rid="j_infor457_tab_004">4</xref>) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table <xref rid="j_infor457_tab_010">10</xref>.</p>
<p>After this, the values of numeric columns were scaled to a range of <inline-formula id="j_infor457_ineq_046"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[0;1]$]]></tex-math></alternatives></inline-formula> with <italic>Scikit-learn</italic> (Pedregosa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_048">2011</xref>) <italic>QuantileTransform</italic>. This transformation assigns each feature into a quantile individually and scales such that it is in the given range on the training set, by default between zero and one.</p>
<p>Further in this research, the 40 best features were selected using the <italic>SelectKBest</italic> procedure from the <italic>Scikit-learn</italic> library (Pedregosa <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_048">2011</xref>) and followed by <italic>Variance inflation factor analysis</italic> with a target threshold value equal to 40, to eliminate variables with high collinearity.</p>
<table-wrap id="j_infor457_tab_010">
<label>Table 10</label>
<caption>
<p>Resulting IDS2017 dataset training and/or validation sample representation.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record label</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Training records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Testing records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">442 421</td>
<td style="vertical-align: top; text-align: left">64.739%</td>
<td style="vertical-align: top; text-align: left">442 421</td>
<td style="vertical-align: top; text-align: left">67.508%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Hulk</td>
<td style="vertical-align: top; text-align: left">86 425</td>
<td style="vertical-align: top; text-align: left">12.646%</td>
<td style="vertical-align: top; text-align: left">86 424</td>
<td style="vertical-align: top; text-align: left">13.187%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDoS</td>
<td style="vertical-align: top; text-align: left">64 008</td>
<td style="vertical-align: top; text-align: left">9.366%</td>
<td style="vertical-align: top; text-align: left">64 008</td>
<td style="vertical-align: top; text-align: left">9.767%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PortScan</td>
<td style="vertical-align: top; text-align: left">45 410</td>
<td style="vertical-align: top; text-align: left">6.645%</td>
<td style="vertical-align: top; text-align: left">4 5409</td>
<td style="vertical-align: top; text-align: left">6.929%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS GoldenEye</td>
<td style="vertical-align: top; text-align: left">5 143</td>
<td style="vertical-align: top; text-align: left">0.753%</td>
<td style="vertical-align: top; text-align: left">5 143</td>
<td style="vertical-align: top; text-align: left">0.785%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-Patator</td>
<td style="vertical-align: top; text-align: left">4 999</td>
<td style="vertical-align: top; text-align: left">0.731%</td>
<td style="vertical-align: top; text-align: left">2 967</td>
<td style="vertical-align: top; text-align: left">0.453%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS slowloris</td>
<td style="vertical-align: top; text-align: left">4 999</td>
<td style="vertical-align: top; text-align: left">0.731%</td>
<td style="vertical-align: top; text-align: left">2 692</td>
<td style="vertical-align: top; text-align: left">0.411%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS Slowhttptest</td>
<td style="vertical-align: top; text-align: left">4 999</td>
<td style="vertical-align: top; text-align: left">0.731%</td>
<td style="vertical-align: top; text-align: left">2 614</td>
<td style="vertical-align: top; text-align: left">0.399%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SSH-Patator</td>
<td style="vertical-align: top; text-align: left">4 999</td>
<td style="vertical-align: top; text-align: left">0.731%</td>
<td style="vertical-align: top; text-align: left">1 610</td>
<td style="vertical-align: top; text-align: left">0.246%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">4 999</td>
<td style="vertical-align: top; text-align: left">0.731%</td>
<td style="vertical-align: top; text-align: left">976</td>
<td style="vertical-align: top; text-align: left">0.149%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-Brute Force</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.439%</td>
<td style="vertical-align: top; text-align: left">735</td>
<td style="vertical-align: top; text-align: left">0.112%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-XSS</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.439%</td>
<td style="vertical-align: top; text-align: left">326</td>
<td style="vertical-align: top; text-align: left">0.050%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infiltration</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.439%</td>
<td style="vertical-align: top; text-align: left">18</td>
<td style="vertical-align: top; text-align: left">0.003%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Web Attack-Sql Injection</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.439%</td>
<td style="vertical-align: top; text-align: left">11</td>
<td style="vertical-align: top; text-align: left">0.002%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Heartbleed</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.439%</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">0.001%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total:</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">683 397</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">655 360</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor457_s_032">
<label>4.2</label>
<title>CIC-IDS2018 Pre-Processing Steps</title>
<p>The same pre-processing procedure from Section <xref rid="j_infor457_s_031">4.1</xref> was applied to dataset CIC-IDS2018.</p>
<p>The timestamp column and related record duplicates were removed, as no time series dependent machine learning methods were chosen in this research.</p>
<p>Afterwards, 8 features ‘<italic>Bwd URG Flags</italic>’, ‘<italic>Bwd Pkts/b Avg</italic>’, ‘<italic>Bwd PSH Flags</italic>’, ‘<italic>Bwd Blk Rate Avg</italic>’, ‘<italic>Fwd Byts/b Avg</italic>’, ‘<italic>Fwd Pkts/b Avg</italic>’, ‘<italic>Fwd Blk Rate Avg</italic>’, ‘<italic>Bwd Byts/b Avg</italic>’ containing no information (eq. <inline-formula id="j_infor457_ineq_047"><alternatives><mml:math>
<mml:mtext mathvariant="italic">Std</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn></mml:math><tex-math><![CDATA[$\textit{Std}=0$]]></tex-math></alternatives></inline-formula>) were removed.</p>
<p>The following sampling procedures were executed in order to achieve a better balance between major classes and extremely rare classes:</p>
<list>
<list-item id="j_infor457_li_022">
<label>1.</label>
<p>the top two classes (‘<italic>Benign</italic>’ and ‘<italic>DDoS attacks-LOIC-HTTP</italic>’) were under-sampled so as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves.</p>
</list-item>
<list-item id="j_infor457_li_023">
<label>2.</label>
<p>The remaining data was split into test and train sub-samples.</p>
</list-item>
<list-item id="j_infor457_li_024">
<label>3.</label>
<p>Training sub-set was then over-sampled with SMOTE (thus, value of 2 999). This procedure keeps all extremely imbalanced class records (Table <xref rid="j_infor457_tab_004">4</xref>) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table <xref rid="j_infor457_tab_011">11</xref>.</p>
</list-item>
</list>
<table-wrap id="j_infor457_tab_011">
<label>Table 11</label>
<caption>
<p>Resulting IDS2018 dataset training and validation sample representation.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record label</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Training records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Testing records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">134 850</td>
<td style="vertical-align: top; text-align: left">20.067%</td>
<td style="vertical-align: top; text-align: left">134 849</td>
<td style="vertical-align: top; text-align: left">20.576%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDoS attacks-LOIC-HTTP</td>
<td style="vertical-align: top; text-align: left">129 558</td>
<td style="vertical-align: top; text-align: left">19.280%</td>
<td style="vertical-align: top; text-align: left">129 558</td>
<td style="vertical-align: top; text-align: left">19.769%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDOS attack-HOIC</td>
<td style="vertical-align: top; text-align: left">99 430</td>
<td style="vertical-align: top; text-align: left">14.796%</td>
<td style="vertical-align: top; text-align: left">99 431</td>
<td style="vertical-align: top; text-align: left">15.172%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infilteration</td>
<td style="vertical-align: top; text-align: left">72 612</td>
<td style="vertical-align: top; text-align: left">10.805%</td>
<td style="vertical-align: top; text-align: left">72 613</td>
<td style="vertical-align: top; text-align: left">11.080%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS attacks-Hulk</td>
<td style="vertical-align: top; text-align: left">72 599</td>
<td style="vertical-align: top; text-align: left">10.804%</td>
<td style="vertical-align: top; text-align: left">72 600</td>
<td style="vertical-align: top; text-align: left">11.078%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Bot</td>
<td style="vertical-align: top; text-align: left">72 268</td>
<td style="vertical-align: top; text-align: left">10.754%</td>
<td style="vertical-align: top; text-align: left">72 267</td>
<td style="vertical-align: top; text-align: left">11.027%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SSH-Bruteforce</td>
<td style="vertical-align: top; text-align: left">47 024</td>
<td style="vertical-align: top; text-align: left">6.998%</td>
<td style="vertical-align: top; text-align: left">47 024</td>
<td style="vertical-align: top; text-align: left">7.175%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS attacks-GoldenEye</td>
<td style="vertical-align: top; text-align: left">20 703</td>
<td style="vertical-align: top; text-align: left">3.081%</td>
<td style="vertical-align: top; text-align: left">20 703</td>
<td style="vertical-align: top; text-align: left">3.159%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS attacks-Slowloris</td>
<td style="vertical-align: top; text-align: left">4 954</td>
<td style="vertical-align: top; text-align: left">0.737%</td>
<td style="vertical-align: top; text-align: left">4 954</td>
<td style="vertical-align: top; text-align: left">0.756%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DDOS attack-LOIC-UDP</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">865</td>
<td style="vertical-align: top; text-align: left">0.132%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-Web</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">285</td>
<td style="vertical-align: top; text-align: left">0.043%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Brute Force-XSS</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">114</td>
<td style="vertical-align: top; text-align: left">0.017%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SQL Injection</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">43</td>
<td style="vertical-align: top; text-align: left">0.007%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-BruteForce</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">27</td>
<td style="vertical-align: top; text-align: left">0.004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DoS attacks-SlowHTTPTest</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.446%</td>
<td style="vertical-align: top; text-align: left">27</td>
<td style="vertical-align: top; text-align: left">0.004%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total:</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">671 992</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">655 360</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>It should be noted that 7 373 records with infinities in two features ‘<italic>Flow Bytes/s</italic>’ and ‘<italic>Flow Packets/s</italic>’ were found and replaced by maximums of values per class, see Table <xref rid="j_infor457_tab_012">12</xref>.</p>
<table-wrap id="j_infor457_tab_012">
<label>Table 12</label>
<caption>
<p>Replacing infinities in IDS2018 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record count</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Flow Bytes/s</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Flow Packets/s</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">6 243</td>
<td style="vertical-align: top; text-align: left">1.47e+09</td>
<td style="vertical-align: top; text-align: left">4.0e+6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Infilteration</td>
<td style="vertical-align: top; text-align: left">1 129</td>
<td style="vertical-align: top; text-align: left">2.74e+08</td>
<td style="vertical-align: top; text-align: left">3.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FTP-BruteForce</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.0e+00</td>
<td style="vertical-align: top; text-align: left">2.0e+06</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total:</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7 373</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>Presence of such values could indicate that related flows were not terminated on recording.</p>
<p>After the data cleaning, the dataset was normalized with <italic>QuantileTransform</italic>. The 40 best features from <italic>SelectKBest</italic> were passed through the <italic>Variance Inflation Factor</italic> procedure with a threshold of 40 which was selected to eliminate collinearity of features.</p>
</sec>
<sec id="j_infor457_s_033">
<label>4.3</label>
<title>LITNET-2020 Dataset Pre-Processing</title>
<p>Due to the choice of supervised machine learning models and problem definition in this study, the LITNET-2020 dataset <italic>timestamp</italic> feature was not used. Features related to the source and destination address, such as source and destination issuing authorities, are highly supportive in discovering not only the attacker but also the attack class, therefore, in order to support generalization of training, they were eliminated.</p>
<p>After removing <italic>timestamp</italic> and address related features, related duplicate records were also removed, see Table <xref rid="j_infor457_tab_013">13</xref>.</p>
<table-wrap id="j_infor457_tab_013">
<label>Table 13</label>
<caption>
<p>Removal of timestamp related duplicates in LITNET-2020 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Traffic type</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Share of removed records (%)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting counts of records<sup>1</sup></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">33.1%</td>
<td style="vertical-align: top; text-align: left">24 349 750</td>
<td style="vertical-align: top; text-align: left">95.052%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SYN Flood</td>
<td style="vertical-align: top; text-align: left">98.2%</td>
<td style="vertical-align: top; text-align: left">28 873</td>
<td style="vertical-align: top; text-align: left">0.113%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Code Red</td>
<td style="vertical-align: top; text-align: left">13.5%</td>
<td style="vertical-align: top; text-align: left">1 085 656</td>
<td style="vertical-align: top; text-align: left">4.238%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Smurf</td>
<td style="vertical-align: top; text-align: left">87.7%</td>
<td style="vertical-align: top; text-align: left">14 642</td>
<td style="vertical-align: top; text-align: left">0.057%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">UDP Flood</td>
<td style="vertical-align: top; text-align: left">1.3%</td>
<td style="vertical-align: top; text-align: left">92 412</td>
<td style="vertical-align: top; text-align: left">0.361%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LAND DoS</td>
<td style="vertical-align: top; text-align: left">75.3%</td>
<td style="vertical-align: top; text-align: left">12 926</td>
<td style="vertical-align: top; text-align: left">0.050%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">W32.Blaster</td>
<td style="vertical-align: top; text-align: left">99.2%</td>
<td style="vertical-align: top; text-align: left">200</td>
<td style="vertical-align: top; text-align: left">0.001%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ICMP Flood</td>
<td style="vertical-align: top; text-align: left">92.6%</td>
<td style="vertical-align: top; text-align: left">1 723</td>
<td style="vertical-align: top; text-align: left">0.007%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">HTTP Flood</td>
<td style="vertical-align: top; text-align: left">1.7%</td>
<td style="vertical-align: top; text-align: left">22 578</td>
<td style="vertical-align: top; text-align: left">0.088%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Scan</td>
<td style="vertical-align: top; text-align: left">0.0%</td>
<td style="vertical-align: top; text-align: left">6 232</td>
<td style="vertical-align: top; text-align: left">0.024%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Reaper Worm</td>
<td style="vertical-align: top; text-align: left">0.3%</td>
<td style="vertical-align: top; text-align: left">1 173</td>
<td style="vertical-align: top; text-align: left">0.005%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">0.1%</td>
<td style="vertical-align: top; text-align: left">746</td>
<td style="vertical-align: top; text-align: left">0.003%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Fragmentation</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">15.9%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">401</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.002%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Record counts after removing timestamp and related record duplicates.</p>
</table-wrap-foot>
</table-wrap>
<p>The resulting dataset is even more imbalanced. The target number of records of the <italic>Benign</italic> and the <italic>Code Red</italic> type was set after learning curves that indicate the number of records required by the worst performing model for sufficient learning. Sufficient learning is defined here as the objective of getting the learning and testing curves to converge within a margin of less than <inline-formula id="j_infor457_ineq_048"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$1\% $]]></tex-math></alternatives></inline-formula>, which for all models under experiment occurs after approximately 0.5 million records.The dataset was further split by half into testing and validation.</p>
<p>As a final step, a Synthetic Minority Over-sampling Technique for Nominal and Continuous features for datasets with categorical features, SMOTE-NC, introduced by Chawla <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_007">2002</xref>) was implemented, see Table <xref rid="j_infor457_tab_014">14</xref>.</p>
<table-wrap id="j_infor457_tab_014">
<label>Table 14</label>
<caption>
<p>LITNET-2020 dataset sample representation.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Record label</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Training records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Testing records</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Resulting share (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Benign</td>
<td style="vertical-align: top; text-align: left">349 470</td>
<td style="vertical-align: top; text-align: left">51.277%</td>
<td style="vertical-align: top; text-align: left">349 470</td>
<td style="vertical-align: top; text-align: left">53.325%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Code Red</td>
<td style="vertical-align: top; text-align: left">215 484</td>
<td style="vertical-align: top; text-align: left">31.618%</td>
<td style="vertical-align: top; text-align: left">215 485</td>
<td style="vertical-align: top; text-align: left">32.880%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">UDP Flood</td>
<td style="vertical-align: top; text-align: left">45 858</td>
<td style="vertical-align: top; text-align: left">6.729%</td>
<td style="vertical-align: top; text-align: left">45 859</td>
<td style="vertical-align: top; text-align: left">6.997%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">SYN Flood</td>
<td style="vertical-align: top; text-align: left">14 436</td>
<td style="vertical-align: top; text-align: left">2.118%</td>
<td style="vertical-align: top; text-align: left">14 437</td>
<td style="vertical-align: top; text-align: left">2.203%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">HTTP Flood</td>
<td style="vertical-align: top; text-align: left">11 289</td>
<td style="vertical-align: top; text-align: left">1.656%</td>
<td style="vertical-align: top; text-align: left">11 289</td>
<td style="vertical-align: top; text-align: left">1.723%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Smurf</td>
<td style="vertical-align: top; text-align: left">9 999</td>
<td style="vertical-align: top; text-align: left">1.467%</td>
<td style="vertical-align: top; text-align: left">7 321</td>
<td style="vertical-align: top; text-align: left">1.117%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Scan</td>
<td style="vertical-align: top; text-align: left">9 999</td>
<td style="vertical-align: top; text-align: left">1.467%</td>
<td style="vertical-align: top; text-align: left">6 463</td>
<td style="vertical-align: top; text-align: left">0.986%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LAND DoS</td>
<td style="vertical-align: top; text-align: left">9 999</td>
<td style="vertical-align: top; text-align: left">1.467%</td>
<td style="vertical-align: top; text-align: left">3 116</td>
<td style="vertical-align: top; text-align: left">0.475%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.440%</td>
<td style="vertical-align: top; text-align: left">710</td>
<td style="vertical-align: top; text-align: left">0.108%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Reaper Worm</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.440%</td>
<td style="vertical-align: top; text-align: left">587</td>
<td style="vertical-align: top; text-align: left">0.090%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ICMP Flood</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.440%</td>
<td style="vertical-align: top; text-align: left">373</td>
<td style="vertical-align: top; text-align: left">0.057%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Fragmentation</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.440%</td>
<td style="vertical-align: top; text-align: left">153</td>
<td style="vertical-align: top; text-align: left">0.023%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">W32.Blaster</td>
<td style="vertical-align: top; text-align: left">2 999</td>
<td style="vertical-align: top; text-align: left">0.440%</td>
<td style="vertical-align: top; text-align: left">100</td>
<td style="vertical-align: top; text-align: left">0.015%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total:</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">681 529</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">655 363</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>After the data cleaning, the dataset was normalized with <italic>QuantileTransform</italic>. The 40 best features from <italic>SelectKBest</italic> were obtained and further checked for feature collinearity. Collinear features were reduced using the Variance Inflation Factor procedure (see Section <xref rid="j_infor457_s_028">3.9</xref>) with a threshold value of 40.</p>
</sec>
<sec id="j_infor457_s_034">
<label>4.4</label>
<title>Experiment Software Environment</title>
<p>All code for models was realized in the Python 3.7 environment on Anaconda 3 using <italic>Scikit-learn</italic><xref ref-type="fn" rid="j_infor457_fn_007">7</xref><fn id="j_infor457_fn_007"><label><sup>7</sup></label>
<p><uri>https://scikit-learn.org/stable/</uri>.</p></fn> and <italic>Imbalanced-learn</italic><xref ref-type="fn" rid="j_infor457_fn_008">8</xref><fn id="j_infor457_fn_008"><label><sup>8</sup></label>
<p><uri>https://imbalanced-learn.org/stable</uri>.</p></fn> libraries, except for the <italic>Gradient Boosting Classifier</italic>, which was implemented using the <italic>XGBoost</italic> library (Chen and Guestrin, <xref ref-type="bibr" rid="j_infor457_ref_008">2016</xref>), utilizing GPU.</p>
<p>Model parameters were searched with the <italic>GridSearch</italic> method. <italic>Tree depth</italic> and <italic>alpha</italic> were further validated using the method of maximum cost path analysis (Breiman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>), implemented in <italic>Scikit-learn</italic> by the <italic>cost-complexity-pruning-path</italic> function (see Section <xref rid="j_infor457_s_027">3.8</xref>).</p>
</sec>
<sec id="j_infor457_s_035">
<label>4.5</label>
<title>Parameter Values Selection</title>
<p>The following parameter ranges were selected for the grid search:</p>
<list>
<list-item id="j_infor457_li_025">
<label>1.</label>
<p>ADA: <italic>n_estimators: (range(10, 256, 5))</italic>, <italic>learning_rate: [0.001, 0.005, 0.01, 0.5, 1]</italic>, and base estimator – CART.</p>
</list-item>
<list-item id="j_infor457_li_026">
<label>2.</label>
<p>CART: <italic>criterion</italic>: (‘entropy’, ‘gini’), <italic>max_depth: range(4, 32)</italic>, <italic>in_samples_leaf: range(6, 10, 1)</italic>, <italic>max_features: [0.5, 0.6, 0.8, 1.0, ‘auto’]</italic>.</p>
</list-item>
<list-item id="j_infor457_li_027">
<label>3.</label>
<p>GBC: <italic>max_depth: range(4, 32, 1)</italic>,</p>
<p><italic>n_estimators: range(100, 256, 5)</italic>, other parameters used from <italic>CART</italic>.</p>
</list-item>
<list-item id="j_infor457_li_028">
<label>4.</label>
<p>KNN: <italic>n_neighbors: range(3, 16, 1)</italic>, <italic>algorithm: [‘ball_tree’, ‘auto’]</italic>,</p>
<p><italic>leaf_size: range(15, 35, 5)</italic></p>
</list-item>
<list-item id="j_infor457_li_029">
<label>5.</label>
<p>MLP: <italic>hidden_layer_sizes: tuple (32 ... 256, 32 ... 256)</italic> (<inline-formula id="j_infor457_ineq_049"><alternatives><mml:math>
<mml:mtext mathvariant="italic">step</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$\textit{step}=1$]]></tex-math></alternatives></inline-formula>), <italic>alpha: np.geomspace(1e</italic>−<italic>2, 2, 50, endpoint</italic> = <italic>True)</italic>, <italic>activation: [‘identity’, ‘logistic’, ‘tanh’, ‘relu’]</italic>, <italic>solver: [‘lbfgs’, ‘sgd’, ‘adam’]</italic>, <italic>learning_rate: [‘constant’, ‘adaptive’]</italic>, <italic>beta_1 : np.linspace(0.85, 0.95, 11, endpoint</italic> = <italic>True)</italic>, <italic>learning_rate_init: np.geomspace(2e</italic>−<italic>4, 6e</italic>−<italic>4, 5, endpoint</italic> = <italic>True)</italic>, <italic>max_iter: [200, 300]</italic>, <italic>early_stopping: [True, False]</italic>.</p>
</list-item>
<list-item id="j_infor457_li_030">
<label>6.</label>
<p>QDA: <italic>reg_param: np.geomspace(1e</italic>−<italic>19, 1e</italic>−<italic>1, 50, endpoint</italic> = <italic>True)</italic>. Value of <italic>tol</italic> parameter only impacts threshold when warnings of variable collinearity should be suppressed.</p>
</list-item>
<list-item id="j_infor457_li_031">
<label>7.</label>
<p>RFC: <italic>n_estimators: range(100, 350, 5)</italic>, other parameters in the same ranges as <italic>CART</italic>.</p>
</list-item>
</list>
<p>The parameters used in this study are presented in the Table <xref rid="j_infor457_tab_015">15</xref>.</p>
<table-wrap id="j_infor457_tab_015">
<label>Table 15</label>
<caption>
<p>Model parameters used.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td colspan="3" style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Dataset</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Model</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td colspan="3" style="vertical-align: top; text-align: center">Parameters</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td colspan="3" style="vertical-align: top; text-align: left">base_estimator = DecisionTreeClassifier, learning_rate = 1<sup>1</sup>, n_estimators = 120, tree parameters as indicated for CART, next row</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">criterion = ‘entropy’,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 32,<break/>ccp_alpha = 0.00001,<break/>class_weight = ‘balanced’</td>
<td style="vertical-align: top; text-align: left">criterion = ‘entropy’,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 32,<break/>ccp_alpha = 0.00001,<break/>class_weight = ‘balanced’</td>
<td style="vertical-align: top; text-align: left">criterion = ‘entropy’,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>ccp_alpha = 0.00001,<break/>class_weight = ‘balanced’</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">n_estimators = 120,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>ccp_alpha = 0.00001,<break/>tree_method = ‘gpu_hist’</td>
<td style="vertical-align: top; text-align: left">n_estimators = 120,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>ccp_alpha = 0.00001,<break/>tree_method = ‘gpu_hist’</td>
<td style="vertical-align: top; text-align: left">n_estimators = 120,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>ccp_alpha = 0.00001,<break/>tree_method = ‘gpu_hist’</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">algorithm = ‘ball_tree’, <break/>leaf_size = 30<sup>1</sup>, <break/>metric = ‘manhattan’<break/>n_neighbors = 4,<break/>weights = ‘distance’</td>
<td style="vertical-align: top; text-align: left">algorithm = ‘ball_tree’, <break/>leaf_size = 30<sup>1</sup>, <break/>metric = ‘manhattan’,<break/>n_neighbors = 4, <break/>weights = ‘uniform’<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">algorithm = ‘ball_tree’,<break/>leaf_size = 30<sup>1</sup>, <break/>metric = ‘minkowski’<sup>1</sup>,<break/>n_neighbors = 4, p = 2<sup>1</sup>,<break/>weights = ‘uniform’<sup>1</sup></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">activation = ‘relu’<sup>1</sup>, <break/>solver = ‘adam’<sup>1</sup>,<break/>alpha = 0.0<sup>1</sup>,<break/>beta_1 = 0.9<sup>1</sup>, <break/>hidden_layer_sizes = (120, 60),<break/>learning_rate = ‘constant’<sup>1</sup>,<break/>learning_rate_init = 0.001<sup>1</sup>,<break/>early_stopping = True<sup>1</sup>,<break/>max_iter = 200<sup>1</sup>,<break/>warm_start = False<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">activation = ‘relu’<sup>1</sup>,<break/>solver = ‘adam’<sup>1</sup>,<break/>alpha = 0.067,<break/>beta_1 = 0.86,<break/>hidden_layer_sizes = (32, 46),<break/>learning_rate = ‘adaptive’,<break/>learning_rate_init = 0.00045,<break/>early_stopping = False,<break/>max_iter = 300,<break/>warm_start = True</td>
<td style="vertical-align: top; text-align: left">activation = ‘relu’<sup>1</sup>,<break/>solver = ‘adam’<sup>1</sup>,<break/>alpha = 0.0<sup>1</sup>,<break/>beta_1 = 0.9<sup>1</sup>, <break/>hidden_layer_sizes = (120, 60),<break/>learning_rate = ‘adaptive’,<break/>learning_rate_init = 0.001<sup>1</sup>,<break/>early_stopping = True<sup>1</sup>,<break/>max_iter = 200<sup>1</sup>,<break/>warm_start = True</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">priors = priors<sup>2</sup>, <break/>reg_param = 2.1e-8,<break/>tol = 0.1</td>
<td style="vertical-align: top; text-align: left">priors = priors<sup>2</sup>,<break/>reg_param = 2.3e-5,<break/>tol = 0.1</td>
<td style="vertical-align: top; text-align: left">priors = priors<sup>2</sup>,<break/>reg_param = 0.002,<break/>tol = 0.1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">RFC</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">criterion = ‘entropy’,<break/>min_samples_leaf = 7,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>n_estimators = 120, <break/>ccp_alpha = 0.0<sup>1</sup>,<break/>class_weight = ‘balanced’</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">criterion = ‘entropy’,<break/>min_samples_leaf = 7,<break/>max_features = 1.0,<break/>max_depth = 15,<break/>n_estimators = 120, <break/>ccp_alpha = 0.0<sup>1</sup>,<break/>class_weight = ‘balanced’</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">criterion = ‘entropy’,<break/>min_samples_leaf = 8,<break/>max_features = 0.5,<break/>max_depth = 15,<break/>n_estimators = 156,<break/>ccp_alpha = 0.00001,<break/>class_weight = ‘balanced’</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Default <italic>Scikit-Learn</italic> values; <sup>2</sup>Priors calculated equal to class shares.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="j_infor457_s_036">
<label>5</label>
<title>Results and Discussion</title>
<sec id="j_infor457_s_037">
<label>5.1</label>
<title>Results of the Conducted Experiments</title>
<p>Tables <xref rid="j_infor457_tab_016">16</xref>, <xref rid="j_infor457_tab_017">17</xref> and <xref rid="j_infor457_tab_018">18</xref> represent the results of ML methods rankings using a <italic>Standard Ranking</italic> approach (Adomavicius and Kwon, <xref ref-type="bibr" rid="j_infor457_ref_001">2011</xref>), where equal items get the same ranking number, and a gap is left in between the smaller and bigger result, where the bigger result means a worse result.</p>
<p>In Table <xref rid="j_infor457_tab_016">16</xref>, the results of scoring by <italic>Balanced Accuracy</italic> are in favour of trees or their ensembles, <italic>Adaboost</italic> being the strongest, closely followed by <italic>Random Forest Classifier</italic> and <italic>K-Nearest Neighbours</italic>.</p>
<table-wrap id="j_infor457_tab_016">
<label>Table 16</label>
<caption>
<p>Comparison of Model performance on 3 datasets using Balanced Accuracy Score (BAS) and Error Rate (ErR).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: middle; text-align: left; border-top: solid thin"/>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2018</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Rank by BAS</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Model</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ErR</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">BAS</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ErR</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">BAS</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ErR</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">BAS</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Best</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">ADA<sup>1</sup></td>
<td style="vertical-align: top; text-align: left">0.001</td>
<td style="vertical-align: top; text-align: left">0.995</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.060</td>
<td style="vertical-align: top; text-align: left">0.887</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0.003</td>
<td style="vertical-align: top; text-align: left">0.996</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">0.004</td>
<td style="vertical-align: top; text-align: left">0.984</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">0.064</td>
<td style="vertical-align: top; text-align: left">0.897</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">0.005</td>
<td style="vertical-align: top; text-align: left">0.985</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">12</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">0.003</td>
<td style="vertical-align: top; text-align: left">0.986</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0.063</td>
<td style="vertical-align: top; text-align: left">0.811</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0.011</td>
<td style="vertical-align: top; text-align: left">0.756</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">14</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">0.006</td>
<td style="vertical-align: top; text-align: left">0.989</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">0.060</td>
<td style="vertical-align: top; text-align: left">0.917</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.044</td>
<td style="vertical-align: top; text-align: left">0.864</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">9</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">0.020</td>
<td style="vertical-align: top; text-align: left">0.937</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0.072</td>
<td style="vertical-align: top; text-align: left">0.860</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">0.070</td>
<td style="vertical-align: top; text-align: left">0.698</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">20</td>
<td style="vertical-align: top; text-align: left">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">0.068</td>
<td style="vertical-align: top; text-align: left">0.951</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">0.090</td>
<td style="vertical-align: top; text-align: left">0.843</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0.022</td>
<td style="vertical-align: top; text-align: left">0.992</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">15</td>
<td style="vertical-align: top; text-align: left">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">RFC</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.002</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.991</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.059</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.898</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.005</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.987</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Adaboost ensemble is made of CART estimators with the grid-searched hyper-parameters described in Table <xref rid="j_infor457_tab_015">15</xref>.</p>
</table-wrap-foot>
</table-wrap>
<p>Results of this research support notion that <italic>Balanced Accuracy</italic> metric (see Table <xref rid="j_infor457_tab_016">16</xref>) should be used for measuring accuracy in case of highly and extremely imbalanced data sets. <italic>Error Rate</italic> for all models is below 0.1, while <italic>Balanced Accuracy</italic> manifests some insufficient learning. <italic>Accuracy</italic> of <italic>Extremely rare</italic> (malicious) classes in this research is dominated by majority (benign) class, representing over 80% of the whole data (see Tables <xref rid="j_infor457_tab_002">2</xref> and <xref rid="j_infor457_tab_003">3</xref>) and therefore <italic>Error Rate</italic> is overly optimistic, under-representing the prediction error of <italic>Extremely rare classes</italic> (see Table <xref rid="j_infor457_tab_004">4</xref>), important to this research.</p>
<p>The ranking results in Table <xref rid="j_infor457_tab_017">17</xref> were obtained based on the minimum of the sum of rankings for <italic>Presicion</italic> and <inline-formula id="j_infor457_ineq_050"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>. The results of scoring by <italic>Precision and</italic> <inline-formula id="j_infor457_ineq_051"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula> are in favour of the same tree ensembles.</p>
<table-wrap id="j_infor457_tab_017">
<label>Table 17</label>
<caption>
<p>Model rankings by Precision (Pr) and G-mean <inline-formula id="j_infor457_ineq_052"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\bar{G})$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: middle; text-align: left; border-top: solid thin"/>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2018</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Rank</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Model</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Pr</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_053"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Pr</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_054"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Pr</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_055"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Best</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">0.928</td>
<td style="vertical-align: top; text-align: left">0.919</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.991</td>
<td style="vertical-align: top; text-align: left">0.990</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.970</td>
<td style="vertical-align: top; text-align: left">0.994</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">0.868</td>
<td style="vertical-align: top; text-align: left">0.886</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">0.971</td>
<td style="vertical-align: top; text-align: left">0.977</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">0.828</td>
<td style="vertical-align: top; text-align: left">0.989</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">15</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">0.892</td>
<td style="vertical-align: top; text-align: left">0.884</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0.988</td>
<td style="vertical-align: top; text-align: left">0.987</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0.963</td>
<td style="vertical-align: top; text-align: left">0.987</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">9</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">0.906</td>
<td style="vertical-align: top; text-align: left">0.912</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0.988</td>
<td style="vertical-align: top; text-align: left">0.987</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0.674</td>
<td style="vertical-align: top; text-align: left">0.519</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">11</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">0.879</td>
<td style="vertical-align: top; text-align: left">0.834</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">0.979</td>
<td style="vertical-align: top; text-align: left">0.977</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">0.685</td>
<td style="vertical-align: top; text-align: left">0.876</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">17</td>
<td style="vertical-align: top; text-align: left">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">0.713</td>
<td style="vertical-align: top; text-align: left">0.839</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0.936</td>
<td style="vertical-align: top; text-align: left">0.881</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0.915</td>
<td style="vertical-align: top; text-align: left">0.978</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">19</td>
<td style="vertical-align: top; text-align: left">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">RFC</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.913</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.907</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.985</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.984</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.937</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.998</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The rankings of bias and variance decomposition in Table <xref rid="j_infor457_tab_018">18</xref> are obtained on a basis of the minimum of the sum of bias and variance (equal to the model mean squared error, when not accounted for the noise component). The bias and variance are calculated according to formulas (<xref rid="j_infor457_eq_007">7</xref>) and (<xref rid="j_infor457_eq_008">8</xref>). To calculate bias, we have to estimate <italic>β</italic> and <inline-formula id="j_infor457_ineq_056"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\beta }$]]></tex-math></alternatives></inline-formula>. <italic>β</italic> is equal to true class labels vector of test dataset. To estimate <inline-formula id="j_infor457_ineq_057"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\beta }$]]></tex-math></alternatives></inline-formula>, the bootstrap with replacement of training dataset is taken 5 times, each time the model is trained and its prediction for each training dataset is stored as a separate vector <inline-formula id="j_infor457_ineq_058"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\beta }$]]></tex-math></alternatives></inline-formula> value. Then <inline-formula id="j_infor457_ineq_059"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mtext mathvariant="italic">Bias</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\textit{Bias}^{2}}$]]></tex-math></alternatives></inline-formula> is estimated as squared length of the difference of average prediction vector (<inline-formula id="j_infor457_ineq_060"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo fence="true" stretchy="false">[</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$E[\hat{\beta }]$]]></tex-math></alternatives></inline-formula>) and test dataset true label vector (<italic>β</italic>) and divided by the number of test records. The variance (<italic>Var</italic>) is then calculated by formula (<xref rid="j_infor457_eq_008">8</xref>), e.g. it estimates the variance in <inline-formula id="j_infor457_ineq_061"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\beta }$]]></tex-math></alternatives></inline-formula> calculated for each bootstrap sample with replacement from the training dataset.</p>
<table-wrap id="j_infor457_tab_018">
<label>Table 18</label>
<caption>
<p>Model rankings using model bias and variance (Var) decomposition.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: middle; text-align: left; border-top: solid thin"/>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2017</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CIC-IDS2018</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LITNET-2020</td>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Rank<sup>1</sup></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Model</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Bias<sup>2</sup></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Var</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Bias<sup>2</sup></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Var</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Bias<sup>2</sup></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Var</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rank</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Best</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">0.09</td>
<td style="vertical-align: top; text-align: left">0.024</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1.36</td>
<td style="vertical-align: top; text-align: left">0.324</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0.22</td>
<td style="vertical-align: top; text-align: left">0.006</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">0.15</td>
<td style="vertical-align: top; text-align: left">0.109</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">1.80</td>
<td style="vertical-align: top; text-align: left">0.966</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">0.26</td>
<td style="vertical-align: top; text-align: left">0.049</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">15</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">0.08</td>
<td style="vertical-align: top; text-align: left">0.025</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1.96</td>
<td style="vertical-align: top; text-align: left">0.201</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0.22</td>
<td style="vertical-align: top; text-align: left">0.041</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">0.14</td>
<td style="vertical-align: top; text-align: left">0.050</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">2.26</td>
<td style="vertical-align: top; text-align: left">0.984</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">1.08</td>
<td style="vertical-align: top; text-align: left">0.335</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">17</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">0.16</td>
<td style="vertical-align: top; text-align: left">0.051</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">2.77</td>
<td style="vertical-align: top; text-align: left">0.477</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0.54</td>
<td style="vertical-align: top; text-align: left">0.231</td>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">17</td>
<td style="vertical-align: top; text-align: left">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">0.56</td>
<td style="vertical-align: top; text-align: left">0.018</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">19.23</td>
<td style="vertical-align: top; text-align: left">0.985</td>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left">1.12</td>
<td style="vertical-align: top; text-align: left">0.003</td>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">21</td>
<td style="vertical-align: top; text-align: left">8</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">RFC</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.11</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.034</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1.90</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.279</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.25</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.006</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Ranking is performed on the sum of model loss variance and bias squared; <sup>2</sup>Bias squared value.</p>
</table-wrap-foot>
</table-wrap>
<p>The <italic>QDA</italic> values that are much higher than average compared to other algorithm errors from the same data in Table <xref rid="j_infor457_tab_018">18</xref> are a characteristic property of models with low number of hyper-parameters as noted in Brownlee (<xref ref-type="bibr" rid="j_infor457_ref_005">2020</xref>). Values obtained in this experiment could be local optima, but authors were not able to find other parameter values that would result in lower difference of values for this model between datasets. However, bias and variance of this model was noticed to be sensitive to changes in a list of features selected before the parameter search process. The list of features chosen for model training is individual for each dataset.</p>
</sec>
<sec id="j_infor457_s_038">
<label>5.2</label>
<title>Discussion and Comparison of the Results</title>
<p>Comparison of results of research in different implementations for CIC-IDS2017 and CSE-CIC-IDS2018 datasets is presented in Table <xref rid="j_infor457_tab_019">19</xref>. Performance metrics are not directly comparable to our research (further in Table <xref rid="j_infor457_tab_019">19</xref> – this research), as validation results in our experiment were obtained using multiple class optimization and 50% of dataset as a hold-out data, versus standard k-fold cross-validation, known to be prone to knowledge leak. In our methodology, cost sensitive model implementations provided classification for multiple class measures. However, for comparison, traditional measures suitable only for balanced datasets are presented with other reviewed studies (see Table <xref rid="j_infor457_tab_019">19</xref>). It is important to note that optimization in this experiment was done on <italic>Balanced Accuracy Score</italic>, therefore, other measures are sub-optimal.</p>
<table-wrap id="j_infor457_tab_019">
<label>Table 19</label>
<caption>
<p>Related research results analysis.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Algorithm</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Dataset</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Precision</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Recall</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">F1</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Source<sup>1</sup></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.77</td>
<td style="vertical-align: top; text-align: left">0.84</td>
<td style="vertical-align: top; text-align: left">0.77</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">(Kanimozhi and Jacob, <xref ref-type="bibr" rid="j_infor457_ref_028">2019a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.818</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">0.900</td>
<td style="vertical-align: top; text-align: left">(Yulianto <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_072">2019</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ADA</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.996</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ID3</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.98</td>
<td style="vertical-align: top; text-align: left">0.98</td>
<td style="vertical-align: top; text-align: left">0.98</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DT</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DT</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">(Kilincer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_031">2021</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">CART</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.995</td>
<td style="vertical-align: top; text-align: left">0.985</td>
<td style="vertical-align: top; text-align: left">0.995</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.995</td>
<td style="vertical-align: top; text-align: left">0.991</td>
<td style="vertical-align: top; text-align: left">0.993</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">0.997</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.970</td>
<td style="vertical-align: top; text-align: left">0.961</td>
<td style="vertical-align: top; text-align: left">0.965</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">GBC</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.987</td>
<td style="vertical-align: top; text-align: left">0.756</td>
<td style="vertical-align: top; text-align: left">0.987</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.96</td>
<td style="vertical-align: top; text-align: left">0.96</td>
<td style="vertical-align: top; text-align: left">0.96</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">(Kanimozhi and Jacob, <xref ref-type="bibr" rid="j_infor457_ref_028">2019a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.993</td>
<td style="vertical-align: top; text-align: left">0.985</td>
<td style="vertical-align: top; text-align: left">0.979</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.958</td>
<td style="vertical-align: top; text-align: left">0.958</td>
<td style="vertical-align: top; text-align: left">0.955</td>
<td style="vertical-align: top; text-align: left">(Kilincer <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_031">2021</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.994</td>
<td style="vertical-align: top; text-align: left">0.994</td>
<td style="vertical-align: top; text-align: left">0.994</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.989</td>
<td style="vertical-align: top; text-align: left">0.989</td>
<td style="vertical-align: top; text-align: left">0.985</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">KNN</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.957</td>
<td style="vertical-align: top; text-align: left">0.864</td>
<td style="vertical-align: top; text-align: left">0.955</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.77</td>
<td style="vertical-align: top; text-align: left">0.83</td>
<td style="vertical-align: top; text-align: left">0.76</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">(Kanimozhi and Jacob, <xref ref-type="bibr" rid="j_infor457_ref_028">2019a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.981</td>
<td style="vertical-align: top; text-align: left">0.980</td>
<td style="vertical-align: top; text-align: left">0.980</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.960</td>
<td style="vertical-align: top; text-align: left">0.959</td>
<td style="vertical-align: top; text-align: left">0.958</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MLP</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.933</td>
<td style="vertical-align: top; text-align: left">0.698</td>
<td style="vertical-align: top; text-align: left">0.929</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LSTM</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">Dutta <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_015">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">DNN</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">Dutta <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_015">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.97</td>
<td style="vertical-align: top; text-align: left">0.88</td>
<td style="vertical-align: top; text-align: left">0.92</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">LDA</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.989</td>
<td style="vertical-align: top; text-align: left">0.991</td>
<td style="vertical-align: top; text-align: left">0.990</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.966</td>
<td style="vertical-align: top; text-align: left">0.932</td>
<td style="vertical-align: top; text-align: left">0.944</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.712</td>
<td style="vertical-align: top; text-align: left">0.648</td>
<td style="vertical-align: top; text-align: left">0.597</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">QDA</td>
<td style="vertical-align: top; text-align: left">LITNET-2020</td>
<td style="vertical-align: top; text-align: left">0.980</td>
<td style="vertical-align: top; text-align: left">0.992</td>
<td style="vertical-align: top; text-align: left">0.979</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.98</td>
<td style="vertical-align: top; text-align: left">0.97</td>
<td style="vertical-align: top; text-align: left">0.97</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CIC-IDS-2017</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">(Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_057">2019</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">0.999</td>
<td style="vertical-align: top; text-align: left">(Kanimozhi and Jacob, <xref ref-type="bibr" rid="j_infor457_ref_028">2019a</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.993</td>
<td style="vertical-align: top; text-align: left">0.992</td>
<td style="vertical-align: top; text-align: left">0.993</td>
<td style="vertical-align: top; text-align: left">(Karatas <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CIC-IDS2017</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">0.998</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">RFC</td>
<td style="vertical-align: top; text-align: left">CSE-CIC-IDS2018</td>
<td style="vertical-align: top; text-align: left">0.991</td>
<td style="vertical-align: top; text-align: left">0.993</td>
<td style="vertical-align: top; text-align: left">0.992</td>
<td style="vertical-align: top; text-align: left">This research</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">RFC</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">LITNET-2020</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0. 996</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.997</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.996</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">This research</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>See explanatory notes related to cited work in Section <xref rid="j_infor457_s_038">5.2</xref>.</p>
</table-wrap-foot>
</table-wrap>
<p>In Sharafaldin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>) authors had an objective to introduce the CIC-IDS-2017 dataset, and default parameter model results of machine learning are presented for purely benchmark purposes of future research. Feature selection was performed using the random forest regression feature selection algorithm. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_062"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> were obtained in their studies in a form of <italic>weighted average</italic> of each evaluation metrics and are represented in Table <xref rid="j_infor457_tab_019">19</xref>. Iterative Dichotomiser 3, decision tree learner with an early stopping, as implemented in Weka (Witten and Frank, <xref ref-type="bibr" rid="j_infor457_ref_070">2002</xref>), is used in their research. In our research the results were obtained using <italic>macro average</italic> for the above mentioned and other performed metrics. <italic>Macro averages</italic> of metrics are more sensitive to the imbalance of classes.</p>
<p>In Sharafaldin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_057">2019</xref>) authors improve results on RFT through proposing super-feature creation versus random feature regression algorithm for feature selection used in previous research (Sharafaldin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_058">2018</xref>). In our research the feature selection was obtained through fast <italic>Kbest</italic> procedure with <italic>Anova F-value</italic> optimization function, however, algorithm has been chosen after testing three classes of feature selection methods.</p>
<p>In Yulianto <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_072">2019</xref>) strategy, SMOTE is utilized with CIC-IDS-2017. However, only benign and DDos class data of CIC-IDS-2017 dataset is taken, calculating binary classification problems, therefore, produces results that are incomparable to our research results. Features in their research are also selected differently, first utilizing Primary Components Analysis (PCA), then the Ensemble Feature Selection (EFS), using <italic>EFS Package</italic> in <italic>R Studio</italic> and ensemble methods <italic>gbm</italic>, <italic>glm</italic>, <italic>lasso</italic>, <italic>ridge and treebag</italic> from the <italic>fscaret</italic> library. The AdaBoost classification with default weak decision tree classifiers was used during the training. Meanwhile, in our research a choice was made to strengthen the base classifier via pruning. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_063"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> obtained are represented in Table <xref rid="j_infor457_tab_019">19</xref>.</p>
<p>Kanimozhi and Jacob (<xref ref-type="bibr" rid="j_infor457_ref_028">2019a</xref>, <xref ref-type="bibr" rid="j_infor457_ref_029">2019b</xref>) classified the CSE-CIC-IDS2018 data set using ADA, RF, kNN, SVM, NB and ANN (Artificial neural network) machine learning methods. For an ANN authors used MLP with two layers, <italic>lbfgs</italic> solver, grid searched <italic>alpha</italic> parameter (for L2 regularization) and <italic>Hidden layer</italic> sizes. In their research, authors used 0–1 classification. Either “Benign” or “Malicious” labels were used for training, making the results directly incomparable with our multi-class approach. Results of the <italic>accuracy</italic>, <italic>precision</italic>, <italic>recall</italic>, <inline-formula id="j_infor457_ineq_064"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> and <italic>AUC</italic> were obtained. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_065"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> are represented in Table <xref rid="j_infor457_tab_019">19</xref>.</p>
<p>In the study Karatas <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>) classified the CSE-CIC-IDS2018 dataset using KNN, RFT, GBC, ADA, DT (Decision tree), and LDA (Linear discriminant analysis with singular value decomposition solver) algorithms. Parameters that were selected for all the implemented algorithms are described in Karatas <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_030">2020</xref>) Table 8. Number of classes was determined to be six (one for non-attack type, and 5 for attack types), making the results directly incomparable with our multi-class approach. Cross-validation with 80%/20% split of training and test data was used. Results of the <italic>accuracy, precision, recall</italic> and <inline-formula id="j_infor457_ineq_066"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> were obtained. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_067"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> are represented in Table <xref rid="j_infor457_tab_019">19</xref>.</p>
<p>In their study Kilincer <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_031">2021</xref>) classified the CSE-CIC-IDS2018 dataset using KNN, DT, and SVM algorithms. Options of Matlab for KNN with KNN Fine algorithm, DT with Fine tree and SVM Quadratic algorithm gave the best results in this research. Results on a limited amount of records (up to 1584 records per class, see Kilincer <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_031">2021</xref>) Table 3) were used in this research for CSE-CIC-IDS2018 dataset classes. Authors focus on UNSW-NB15 dataset with no discussion on pre-processing for CSE-CIC-IDS2018, parameter search or tree pruning or overfitting. Results of the <italic>accuracy, precision, recall</italic>, <inline-formula id="j_infor457_ineq_068"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> and <italic>g-mean</italic> were obtained. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_069"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> are represented in Table <xref rid="j_infor457_tab_019">19</xref>.</p>
<p>In Dutta <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_015">2020</xref>) authors used SMOTE and ENN to balance the LITNET-2020 dataset. Classes are reduced to two, <italic>normal</italic> and <italic>malignant</italic>, therefore, results are directly incomparable with ours. The approach also differs in that authors reduce dimensionality with Deep sparse autoencoder (Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_073">2018</xref>), selecting 15 features. Then authors stack LSTM with <italic>adam</italic> optimizer and DNN with four layers, back-propagation and stochastic gradient descent as the optimizer and early stopping on Keras with TF back-end and Scikit-learn. 5-fold validation was used in that research. Results of the precision, recall, false positive rate, and MCC were obtained. The results of <italic>Precision</italic>, <italic>Recall</italic> and <inline-formula id="j_infor457_ineq_070"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${F_{1}}$]]></tex-math></alternatives></inline-formula> are represented in Table <xref rid="j_infor457_tab_019">19</xref>.</p>
</sec>
<sec id="j_infor457_s_039">
<label>5.3</label>
<title>Known Limitations</title>
<p>Regarding the limitations of the approach taken in this research, it is important to note that new categories of malicious traffic in reality are introduced daily. Therefore, models tuned using this method will not detect zero day threats.</p>
<p>Another know limitation is that in <italic>absolute rarity</italic> case, or when data has not been obtained and labelled sufficiently, models will predict with high <italic>Error rate</italic>. A possible known solution to this problem is an anomaly detection for the unseen data.</p>
<p>Moreover, datasets CIC-IDS2017 and IDS-2018 lack some categorical flag data, which is possible to obtain, like it has been demonstrated in LITNET-2020 case.</p>
<p>Even though LITNET-2020 lacks temporal features, introduced in CIC-IDS datasets, this, however, can be resolved by running the CICFlowMeter on the original PCAP files.</p>
<p>Temporal average approach of flags does not help some classes like <italic>Infiltration</italic>, however, flag features could be added to CIC-IDS datasets in the future.</p>
<p>While SMOTE was helpful for some rare classes, the method did not help much where sub-classes overlap due to lack of host data or feature latency.</p>
<p>Some features can be extracted and supplemented, which might be used in future research, however, extraction requires high degree of previous network traffic logging, whereas authors are aware that organizations lack resources to collect data on such a level of detail.</p>
</sec>
<sec id="j_infor457_s_040">
<label>5.4</label>
<title>Observations on Multi-Class Predictions</title>
<p>Details of comparison of each class and dataset before and after SMOTE up-sampling is not represented here due to substantial amount of tables. However, it is important to note that some rare classes in these datasets learn very well even with a small numbers of records, which is confirmed by testing using dedicated unseen data. Some classes learn significantly better after adding synthetic data, which is further supported with tests on model performance and classification reports executed before (prefixed with <italic>n</italic> as <inline-formula id="j_infor457_ineq_071"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi></mml:math><tex-math><![CDATA[$nPr$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor457_ineq_072"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$n\bar{G}$]]></tex-math></alternatives></inline-formula> for no-SMOTE) and after enriching data using SMOTE procedure in Table <xref rid="j_infor457_tab_020">20</xref> prefixed with <italic>s</italic> as <italic>sPr</italic> and <inline-formula id="j_infor457_ineq_073"><alternatives><mml:math>
<mml:mi mathvariant="italic">s</mml:mi><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$s\bar{G}$]]></tex-math></alternatives></inline-formula>.</p>
<table-wrap id="j_infor457_tab_020">
<label>Table 20</label>
<caption>
<p>MLP model results for Precision <inline-formula id="j_infor457_ineq_074"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(Pr)$]]></tex-math></alternatives></inline-formula>, and G-mean <inline-formula id="j_infor457_ineq_075"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\bar{G})$]]></tex-math></alternatives></inline-formula> on LITNET-2020 dataset before and after SMOTE.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class<sup>1</sup></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_076"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi></mml:math><tex-math><![CDATA[$nPr$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_077"><alternatives><mml:math>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi></mml:math><tex-math><![CDATA[$sPr$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_078"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$n\bar{G}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_infor457_ineq_079"><alternatives><mml:math>
<mml:mi mathvariant="italic">s</mml:mi><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$s\bar{G}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Reaper Worm</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0.778</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0.972</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Spam Botnet</td>
<td style="vertical-align: top; text-align: left">0.631</td>
<td style="vertical-align: top; text-align: left">0.912</td>
<td style="vertical-align: top; text-align: left">0.766</td>
<td style="vertical-align: top; text-align: left">0.988</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">W32.Blaster</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.285</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.969</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><sup>1</sup>Selected example rare classes.</p>
</table-wrap-foot>
</table-wrap>
<p>As demonstrated in Table <xref rid="j_infor457_tab_020">20</xref>, random data under-sampling and SMOTE over-sampling techniques are supportive in ensuring that extremely under-represented classes (see Table <xref rid="j_infor457_tab_004">4</xref>) can learn with non-zero precision and <inline-formula id="j_infor457_ineq_080"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, or provide better results.</p>
</sec>
</sec>
<sec id="j_infor457_s_041">
<label>6</label>
<title>Conclusions</title>
<p>In this paper, we have studied three highly imbalanced network intrusion datasets and proposed methodology steps (see Section <xref rid="j_infor457_s_030">4</xref>), helping to achieve high classification results of rare classes which were validated through model error decomposition and 50% data hold-out strategy. This methodology was checked using a novel, differently structured dataset LITNET-2020, and comparison of the results to those obtained on the established benchmark datasets CIC-IDS2017 and CSE-CIC-IDS2018.</p>
<p>A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor457_ref_023">2016</xref>) is first introduced in Section <xref rid="j_infor457_s_005">2.2</xref>. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section <xref rid="j_infor457_s_012">3.1</xref>), is used to reduce imbalance of classes in a nonlinear fashion, and SMOTE-NC up-sampling (see Section <xref rid="j_infor457_s_013">3.2</xref>) is executed to increase representation of under-represented classes. Further on in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the recent LITNET-2020 dataset is discussed in Section <xref rid="j_infor457_s_036">5</xref>. As LITNET-2020 is constructed differently from the CIC-IDS datasets, a conclusion can be made that the proposed method is resistant to dataset change. Performance metrics – balanced accuracy (Formula (<xref rid="j_infor457_eq_002">2</xref>)) and geometric mean of recall (Formula (<xref rid="j_infor457_eq_004">4</xref>)), better suited for multi-class classification used for the LITNET-2020 dataset, is another introduced novelty (see results in Tables <xref rid="j_infor457_tab_016">16</xref> and <xref rid="j_infor457_tab_017">17</xref>), not discussed by other authors using these datasets. Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section <xref rid="j_infor457_s_030">4</xref>). Additional ML model, Gradient Boosting Classifier, utilizing ensemble of classification and regression trees, was introduced for benchmark in this research via the use of <italic>XGBoost</italic> library (Chen and Guestrin, <xref ref-type="bibr" rid="j_infor457_ref_008">2016</xref>) incarnation with GPU support (see Section <xref rid="j_infor457_s_022">3.5.6</xref>). In our methodology, cost sensitive model implementations have been used and have provided some better results (see Table <xref rid="j_infor457_tab_019">19</xref>) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research has been achieved through decomposition of classification error into bias and variance (see results in Table <xref rid="j_infor457_tab_018">18</xref>). Instead of the weak <italic>CART</italic> base classifiers (see Section <xref rid="j_infor457_s_027">3.8</xref>) parameters were GirdSearch’ed and parameters <italic>Tree depth</italic> and <italic>alpha</italic> were validated using the method of maximum cost path analysis (Breiman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor457_ref_004">1984</xref>). Other models were tuned using <italic>Gridsearch</italic> and <italic>Balanced Accuracy Score</italic> was scored as an optimization goal.</p>
<p>Machine learning algorithm rankings based on <italic>Precision</italic>, <italic>Balanced Accuracy Score</italic>, <inline-formula id="j_infor457_ineq_081"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">G</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{G}$]]></tex-math></alternatives></inline-formula>, and <italic>Bias – Variance decomposition of Error</italic>, show that tree ensembles (<italic>Adaboost</italic>, <italic>Random Forest Trees and Gradient Boosting Classifier</italic>) perform best on the compared here network intrusion datasets, including the recent LITNET-2020.</p>
</sec>
</body>
<back>
<ref-list id="j_infor457_reflist_001">
<title>References</title>
<ref id="j_infor457_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Adomavicius</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Kwon</surname>, <given-names>Y.</given-names></string-name> (<year>2011</year>). <article-title>Improving aggregate recommendation diversity using ranking-based techniques</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>24</volume>(<issue>5</issue>), <fpage>896</fpage>–<lpage>911</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_002">
<mixed-citation publication-type="other"><string-name><surname>Batista</surname>, <given-names>G.E.A.P.A.</given-names></string-name>, <string-name><surname>Prati</surname>, <given-names>R.C.</given-names></string-name>, <string-name><surname>Monard</surname>, <given-names>M.C.</given-names></string-name> (2004). A study of the behavior of several methods for balancing machine learning training data. <italic>ACM SIGKDD Explorations Newsletter</italic>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/1007730.1007735" xlink:type="simple">https://doi.org/10.1145/1007730.1007735</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name> (<year>2001</year>). <article-title>Random forests</article-title>. <source>Machine Learning</source>, <volume>45</volume>, <fpage>58</fpage>–<lpage>32</lpage> <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1023/A:1010933404324" xlink:type="simple">https://doi.org/10.1023/A:1010933404324</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_004">
<mixed-citation publication-type="book"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Stone</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Olshen</surname>, <given-names>R.</given-names></string-name> (<year>1984</year>). <source>Classification and Regression Trees (Wadsworth Statistics/Probability)</source>, <isbn>0412048418</isbn>. <publisher-name>CRC Press</publisher-name>, <publisher-loc>New York</publisher-loc>,</mixed-citation>
</ref>
<ref id="j_infor457_ref_005">
<mixed-citation publication-type="book"><string-name><surname>Brownlee</surname>, <given-names>J.</given-names></string-name> (<year>2020</year>). <source>Imbalanced Classification with Python – Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning</source>. <publisher-name>Machine Learning Mastery</publisher-name>, <publisher-loc>San Juan</publisher-loc>, pp. <fpage>463</fpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Buczak</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Guven</surname>, <given-names>E.</given-names></string-name> (<year>2016</year>). <article-title>A survey of data mining and machine learning methods for cyber security intrusion detection</article-title>. <source>IEEE Communications Surveys I&amp; Tutorials</source>, <volume>18</volume>, <fpage>1153</fpage>–<lpage>1176</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/COMST.2015.2494502" xlink:type="simple">https://doi.org/10.1109/COMST.2015.2494502</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Chawla</surname>, <given-names>N.V.</given-names></string-name>, <string-name><surname>Bowyer</surname>, <given-names>K.W.</given-names></string-name>, <string-name><surname>Hall</surname>, <given-names>L.O.</given-names></string-name>, <string-name><surname>Kegelmeyer</surname>, <given-names>W.P.</given-names></string-name> (<year>2002</year>). <article-title>SMOTE: synthetic minority over-sampling technique</article-title>. <source>Journal of Artificial Intelligence Research</source>, <volume>16</volume>, <fpage>321</fpage>–<lpage>357</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1613/jair.953" xlink:type="simple">https://doi.org/10.1613/jair.953</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_008">
<mixed-citation publication-type="chapter"><string-name><surname>Chen</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Guestrin</surname>, <given-names>C.</given-names></string-name> (<year>2016</year>). <chapter-title>XGBoost: a scalable tree boosting system</chapter-title>. In: <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16</source>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York, NY, USA</publisher-loc>, pp. <fpage>785</fpage>–<lpage>794</lpage>. <isbn>978-1-4503-4232-2</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2939672.2939785" xlink:type="simple">https://doi.org/10.1145/2939672.2939785</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Chicco</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Jurman</surname>, <given-names>G.</given-names></string-name> (<year>2020</year>). <article-title>The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation</article-title>. <source>BMC Genomics</source>, <volume>21</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1186/s12864-019-6413-7" xlink:type="simple">https://doi.org/10.1186/s12864-019-6413-7</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_010">
<mixed-citation publication-type="other"><string-name><surname>Claise</surname>, <given-names>B.</given-names></string-name> (2004). <italic>RFC 3954, Cisco Systems NetFlow Services Export Version 9</italic>. Technical report, IETF. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.17487/rfc3954" xlink:type="simple">https://doi.org/10.17487/rfc3954</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_011">
<mixed-citation publication-type="journal"><string-name><surname>Damasevicius</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Venckauskas</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Grigaliunas</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Toldinas</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Morkevicius</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Aleliunas</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Smuikys</surname>, <given-names>P.</given-names></string-name> (<year>2020</year>). <article-title>Litnet-2020: An annotated real-world network flow dataset for network intrusion detection</article-title>. <source>Electronics (Switzerland)</source>, <volume>9</volume>(<issue>5</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/electronics9050800" xlink:type="simple">https://doi.org/10.3390/electronics9050800</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_012">
<mixed-citation publication-type="chapter"><string-name><surname>Domingos</surname>, <given-names>P.</given-names></string-name> (<year>2000</year>). <chapter-title>A unified bias-variance decomposition and its applications</chapter-title>. In: <source>Icml</source>, pp. <fpage>231</fpage>–<lpage>238</lpage>. <isbn>2065432969</isbn>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_013">
<mixed-citation publication-type="other"><string-name><surname>Draper-Gil</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Lashkari</surname>, <given-names>A.H.</given-names></string-name>, <string-name><surname>Mamun</surname>, <given-names>M.S.I.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (2016). Characterization of encrypted and VPN traffic using time-related features. In: <italic>Proceedings of the 2nd International Conference on Information Systems Security and Privacy</italic>, PP. 407–414. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.5220/0005740704070414" xlink:type="simple">https://doi.org/10.5220/0005740704070414</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_014">
<mixed-citation publication-type="other"><string-name><surname>Dudani</surname>, <given-names>S.A.</given-names></string-name> (1976). The distance-weighted k-nearest-neighbor rule. <italic>IEEE Transactions on Systems, Man and Cybernetics</italic>, pp. 325–327. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TSMC.1976.5408784" xlink:type="simple">https://doi.org/10.1109/TSMC.1976.5408784</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_015">
<mixed-citation publication-type="journal"><string-name><surname>Dutta</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Choraś</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Pawlicki</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kozik</surname>, <given-names>R.</given-names></string-name> (<year>2020</year>). <article-title>A deep learning ensemble for network anomaly and cyber-attack detection</article-title>. <source>Sensors (Switzerland)</source>, <volume>20</volume>(<issue>16</issue>), <fpage>1</fpage>–<lpage>20</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/s20164583" xlink:type="simple">https://doi.org/10.3390/s20164583</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_016">
<mixed-citation publication-type="journal"><string-name><surname>Ferri</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Hernández-Orallo</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Modroiu</surname>, <given-names>R.</given-names></string-name> (<year>2009</year>). <article-title>An experimental comparison of performance measures for classification</article-title>. <source>Pattern Recognition Letters</source>, <volume>30</volume>(<issue>1</issue>), <fpage>27</fpage>–<lpage>38</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.patrec.2008.08.010" xlink:type="simple">https://doi.org/10.1016/j.patrec.2008.08.010</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Fisher</surname>, <given-names>R.</given-names></string-name> (<year>1954</year>). <article-title>The analysis of variance with various binomial transformations</article-title>. <source>Biometrics</source>, <volume>10</volume>(<issue>1</issue>), <fpage>130</fpage>–<lpage>139</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_018">
<mixed-citation publication-type="journal"><string-name><surname>Freund</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Schapire</surname>, <given-names>R.E.</given-names></string-name> (<year>1997</year>). <article-title>A decision-theoretic generalization of on-line learning and an application to boosting</article-title>. <source>Journal of Computer and System Sciences</source>, <volume>55</volume>, <fpage>119</fpage>–<lpage>139</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1006/jcss.1997.1504" xlink:type="simple">https://doi.org/10.1006/jcss.1997.1504</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>J.H.</given-names></string-name> (<year>2001</year>). <article-title>Greedy function approximation: a gradient boosting machine</article-title>. <source>Annals of Statistics</source>, <volume>29</volume>(<issue>5</issue>), <fpage>1189</fpage>–<lpage>1232</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/aos/1013203451" xlink:type="simple">https://doi.org/10.1214/aos/1013203451</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_020">
<mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>J.H.</given-names></string-name> (<year>2002</year>). <article-title>Stochastic gradient boosting</article-title>. <source>Computational Statistics and Data Analysis</source>, <volume>38</volume>(<issue>4</issue>), <fpage>367</fpage>–<lpage>378</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/S0167-9473(01)00065-2" xlink:type="simple">https://doi.org/10.1016/S0167-9473(01)00065-2</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_021">
<mixed-citation publication-type="chapter"><string-name><surname>Garcia</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Mollineda</surname>, <given-names>R.A.</given-names></string-name>, <string-name><surname>Sanchez</surname>, <given-names>J.S.</given-names></string-name> (<year>2010</year>). <chapter-title>Theoretical analysis of a performance measure for imbalanced data</chapter-title>. In: <source>2010 20th International Conference on Pattern Recognition</source>. <publisher-name>IEEE</publisher-name>, <publisher-loc>Istanbul</publisher-loc>, pp. <fpage>617</fpage>–<lpage>620</lpage>. <isbn>978-1-4244-7542-1</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ICPR.2010.156" xlink:type="simple">https://doi.org/10.1109/ICPR.2010.156</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Geisser</surname>, <given-names>S.</given-names></string-name> (<year>1964</year>). <article-title>Posterior odds for multivariate normal classifications</article-title>. <source>Journal of the Royal Statistical Society: Series B (Methodological)</source>, <volume>26</volume>(<issue>1</issue>), <fpage>69</fpage>–<lpage>76</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Gharib</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sharafaldin</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Lashkari</surname>, <given-names>A.H.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (<year>2016</year>). <chapter-title>An evaluation framework for intrusion detection dataset</chapter-title>. In: <source>2016 International Conference on Information Science and Security (ICISS)</source>. <publisher-name>IEEE</publisher-name>, <publisher-loc>Pattaya, Thailand</publisher-loc>, pp. <fpage>1</fpage>–<lpage>6</lpage>. <isbn>978-1-5090-5493-0</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ICISSEC.2016.7885840" xlink:type="simple">https://doi.org/10.1109/ICISSEC.2016.7885840</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_024">
<mixed-citation publication-type="journal"><string-name><surname>Hart</surname>, <given-names>P.E.</given-names></string-name> (<year>1968</year>). <article-title>The condensed nearest neighbor rule (Corresp.)</article-title>. <source>IEEE Transactions on Information Theory</source>, <volume>14</volume>(<issue>3</issue>), <fpage>515</fpage>–<lpage>516</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TIT.1968.1054155" xlink:type="simple">https://doi.org/10.1109/TIT.1968.1054155</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_025">
<mixed-citation publication-type="book"><string-name><surname>He</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>Y.</given-names></string-name> (<year>2013</year>). <source>Imbalanced Learning: Foundations, Algorithms, and Applications</source>. <publisher-name>Wiley</publisher-name>, <publisher-loc>Piscataway, NJ</publisher-loc>, pp. <fpage>216</fpage>. <isbn>9781118074626</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/9781118646106" xlink:type="simple">https://doi.org/10.1002/9781118646106</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_026">
<mixed-citation publication-type="other"><string-name><surname>Hettich</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Bay</surname>, <given-names>S.D.</given-names></string-name> (1999). The UCI KDD Archive <uri>http://kdd.ics.uci.edu</uri>. University of California, Department of Information and Computer Science.</mixed-citation>
</ref>
<ref id="j_infor457_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Jurman</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Riccadonna</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Furlanello</surname>, <given-names>C.</given-names></string-name> (<year>2012</year>). <article-title>A comparison of MCC and CEN error measures in multi-class prediction</article-title>. <source>PLoS ONE</source>, <volume>7</volume>(<issue>8</issue>), <fpage>41882</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0041882" xlink:type="simple">https://doi.org/10.1371/journal.pone.0041882</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_028">
<mixed-citation publication-type="journal"><string-name><surname>Kanimozhi</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Jacob</surname>, <given-names>D.T.P.</given-names></string-name> (<year>2019</year>a). <article-title>Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing</article-title>. <source>International Journal of Engineering Applied Sciences and Technology</source>, <volume>04</volume>(<issue>06</issue>), <fpage>209</fpage>–<lpage>213</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.33564/IJEAST.2019.v04i06.036" xlink:type="simple">https://doi.org/10.33564/IJEAST.2019.v04i06.036</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Kanimozhi</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Jacob</surname>, <given-names>T.P.</given-names></string-name> (<year>2019</year>b). <article-title>Artificial intelligence based network intrusion detection with hyper-parameter optimization tuning on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing</article-title>. <source>ICT Express</source>, <volume>5</volume>(<issue>3</issue>), <fpage>211</fpage>–<lpage>214</lpage>. <isbn>9781538675953</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.icte.2019.03.003" xlink:type="simple">https://doi.org/10.1016/j.icte.2019.03.003</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_030">
<mixed-citation publication-type="journal"><string-name><surname>Karatas</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Demir</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Sahingoz</surname>, <given-names>O.K.</given-names></string-name> (<year>2020</year>). <article-title>Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset</article-title>. <source>IEEE Access</source>, <volume>8</volume>, <fpage>32150</fpage>–<lpage>32162</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ACCESS.2020.2973219" xlink:type="simple">https://doi.org/10.1109/ACCESS.2020.2973219</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Kilincer</surname>, <given-names>I.F.</given-names></string-name>, <string-name><surname>Ertam</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Sengur</surname>, <given-names>A.</given-names></string-name> (<year>2021</year>). <article-title>Machine learning methods for cyber security intrusion detection: datasets and comparative study</article-title>. <source>Computer Networks</source>, <volume>188</volume>(<issue>January</issue>), <fpage>107840</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.comnet.2021.107840" xlink:type="simple">https://doi.org/10.1016/j.comnet.2021.107840</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_032">
<mixed-citation publication-type="other"><string-name><surname>Koch</surname>, <given-names>R.</given-names></string-name> (2011). Towards next-generation intrusion detection. In: <italic>2011 3rd International Conference on Cyber Conflict</italic>, pp. 151–168.</mixed-citation>
</ref>
<ref id="j_infor457_ref_033">
<mixed-citation publication-type="chapter"><string-name><surname>Kubat</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Matwin</surname>, <given-names>S.</given-names></string-name> (<year>1997</year>). <chapter-title>Addressing the curse of imbalanced data sets: one-sided sampling</chapter-title>. In: <source>Proceedings of the Fourteenth International Conference on Machine Learning</source>, pp. <fpage>179</fpage>–<lpage>186</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/3-540-62858-4_79" xlink:type="simple">https://doi.org/10.1007/3-540-62858-4_79</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_034">
<mixed-citation publication-type="other"><string-name><surname>Kurniabudi</surname></string-name>, <string-name><surname>Stiawan</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Darmawijoyo</surname></string-name>, <string-name><surname>Bin Idris</surname>, <given-names>M.Y.B.</given-names></string-name>, <string-name><surname>Bamhdi</surname>, <given-names>A.M.</given-names></string-name>, <string-name><surname>Budiarto</surname>, <given-names>R.</given-names></string-name> (2020). CICIDS-2017 dataset feature analysis with information gain for anomaly detection. In: <italic>IEEE Access</italic>, pp. 132911–132921 <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ACCESS.2020.3009843" xlink:type="simple">https://doi.org/10.1109/ACCESS.2020.3009843</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_035">
<mixed-citation publication-type="other"><string-name><surname>Lashkari</surname>, <given-names>A.H.</given-names></string-name>, <string-name><surname>Gil</surname>, <given-names>G.D.</given-names></string-name>, <string-name><surname>Mamun</surname>, <given-names>M.S.I.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (2017). Characterization of tor traffic using time based features. In: <italic>Proceedings of the 3rd International Conference on Information Systems Security and Privacy</italic>, pp. 253–262. 978-989-758-209-7. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.5220/0006105602530262" xlink:type="simple">https://doi.org/10.5220/0006105602530262</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_036">
<mixed-citation publication-type="book"><string-name><surname>Laurikkala</surname>, <given-names>J.</given-names></string-name> (2001). <source>Improving Identification of Difficult Small Classes by Balancing Class Distribution</source>. <publisher-name>Springer</publisher-name>. <isbn>3540422943</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/3-540-48229-6_9" xlink:type="simple">https://doi.org/10.1007/3-540-48229-6_9</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_037">
<mixed-citation publication-type="journal"><string-name><surname>LaValle</surname>, <given-names>S.M.</given-names></string-name>, <string-name><surname>Branicky</surname>, <given-names>M.S.</given-names></string-name>, <string-name><surname>Lindemann</surname>, <given-names>S.R.</given-names></string-name> (<year>2004</year>). <article-title>On the relationship between classical grid search and probabilistic roadmaps</article-title>. <source>The International Journal of Robotics Research</source>, <volume>23</volume>(<issue>7–8</issue>), <fpage>673</fpage>–<lpage>692</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_038">
<mixed-citation publication-type="other"><string-name><surname>Lawrence Berkeley National Laboratory</surname></string-name> (2010). <italic>The Internet Traffic Archive</italic>. <uri>http://ita.ee.lbl.gov/index.html</uri>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_039">
<mixed-citation publication-type="journal"><string-name><surname>Lemaitre</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Nogueira</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Aridas</surname>, <given-names>C.K.</given-names></string-name> (<year>2016</year>). <article-title>Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning</article-title>. <source>Journal of Machine Learning Research</source>, <volume>18</volume>, <fpage>1</fpage>–<lpage>5</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_040">
<mixed-citation publication-type="journal"><string-name><surname>Lemaître</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Nogueira</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Aridas</surname>, <given-names>C.K.</given-names></string-name> (<year>2017</year>). <article-title>Imbalanced-learn: a python toolbox to tackle the urse of imbalanced datasets in machine learning</article-title>. <source>Journal of Machine Learning Research</source>, <volume>18</volume>(<issue>17</issue>), <fpage>1</fpage>–<lpage>5</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_041">
<mixed-citation publication-type="journal"><string-name><surname>Lin</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Foster</surname>, <given-names>D.P.</given-names></string-name>, <string-name><surname>Ungar</surname>, <given-names>L.H.</given-names></string-name> (<year>2011</year>). <article-title>VIF regression: a fast regression algorithm for large data</article-title>. <source>Journal of the American Statistical Association</source>, <volume>106</volume>(<issue>493</issue>), <fpage>232</fpage>–<lpage>247</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1198/jasa.2011.tm10113" xlink:type="simple">https://doi.org/10.1198/jasa.2011.tm10113</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_042">
<mixed-citation publication-type="other"><string-name><surname>Lippmann</surname>, <given-names>R.P.</given-names></string-name>, <string-name><surname>Fried</surname>, <given-names>D.J.</given-names></string-name>, <string-name><surname>Graf</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Haines</surname>, <given-names>J.W.</given-names></string-name>, <string-name><surname>Kendall</surname>, <given-names>K.R.</given-names></string-name>, <string-name><surname>McClung</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Weber</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Webster</surname>, <given-names>S.E.</given-names></string-name>, <string-name><surname>Wyschogrod</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Cunningham</surname>, <given-names>R.K.</given-names></string-name>, <string-name><surname>Zissman</surname>, <given-names>M.A.</given-names></string-name> (1999). Evaluating intrusion detection systems without attacking your friends: the 1998 DARPA intrusion detection evaluation. In: <italic>Proceedings DARPA Information Survivability Conference and Exposition, 2000. DISCEX‘00</italic>, PP. 12–26. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/DISCEX.2000.821506" xlink:type="simple">https://doi.org/10.1109/DISCEX.2000.821506</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_043">
<mixed-citation publication-type="journal"><string-name><surname>Maciá-Fernández</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Camacho</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Magán-Carrión</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>García-Teodoro</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Therón</surname>, <given-names>R.</given-names></string-name> (<year>2018</year>). <article-title>UGR‘16: a new dataset for the evaluation of cyclostationarity-based network IDSs</article-title>. <source>Computers and Security</source>, <volume>73</volume>, <fpage>411</fpage>–<lpage>424</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.cose.2017.11.004" xlink:type="simple">https://doi.org/10.1016/j.cose.2017.11.004</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_044">
<mixed-citation publication-type="chapter"><string-name><surname>Małowidzki</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Berezinski</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mazur</surname>, <given-names>M.</given-names></string-name> (<year>2015</year>). <chapter-title>Network intrusion detection: Half a kingdom for a good dataset</chapter-title>. In: <source>Proceedings of NATO STO SAS-139 Workshop, Portugal</source>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_045">
<mixed-citation publication-type="journal"><string-name><surname>Matthews</surname>, <given-names>B.W.</given-names></string-name> (<year>1975</year>). <article-title>Comparison of the predicted and observed secondary structure of T4 phage lysozyme</article-title>. <source>Biochimica et Biophysica Acta (BBA) – Protein Structure</source>, <volume>405</volume>(<issue>2</issue>), <fpage>442</fpage>–<lpage>451</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/0005-2795(75)90109-9" xlink:type="simple">https://doi.org/10.1016/0005-2795(75)90109-9</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_046">
<mixed-citation publication-type="other"><string-name><surname>Mosley</surname>, <given-names>L.</given-names></string-name> (2013). <italic>A balanced approach to the multi-class imbalance problem</italic>. Iowa State University, Ames, Iowa. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31274/etd-180810-3375" xlink:type="simple">https://doi.org/10.31274/etd-180810-3375</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_047">
<mixed-citation publication-type="journal"><string-name><surname>Ortigosa-Hernández</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Inza</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Lozano</surname>, <given-names>J.A.</given-names></string-name> (<year>2017</year>). <article-title>Measuring the class-imbalance extent of multi-class problems</article-title>. <source>Pattern Recognition Letters</source>, <volume>98</volume>, <fpage>32</fpage>–<lpage>38</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.patrec.2017.08.002" xlink:type="simple">https://doi.org/10.1016/j.patrec.2017.08.002</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_048">
<mixed-citation publication-type="journal"><string-name><surname>Pedregosa</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Varoquaux</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Gramfort</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Michel</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Thirion</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Grisel</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Blondel</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Prettenhofer</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Weiss</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Dubourg</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>VanderPlas</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Passos</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Cournapeau</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Brucher</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Perrot</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Duchesnay</surname>, <given-names>E.</given-names></string-name> (<year>2011</year>). <article-title>Scikit-learn: machine Learning in Python</article-title>. <source>Journal of Machine Learning Research</source>, <volume>12</volume>, <fpage>2825</fpage>–<lpage>2830</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_049">
<mixed-citation publication-type="journal"><string-name><surname>Quinlan</surname>, <given-names>J.R.</given-names></string-name> (<year>1986</year>). <article-title>Induction of decision trees</article-title>. <source>Machine Learning</source>, <volume>1</volume>, <fpage>81</fpage>–<lpage>106</lpage>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_050">
<mixed-citation publication-type="other"><string-name><surname>Raschka</surname>, <given-names>S.</given-names></string-name> (2018). <italic>Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning</italic>. <uri>http://arxiv.org/abs/1811.12808</uri>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_051">
<mixed-citation publication-type="other"><string-name><surname>Ring</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wunderlich</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Grudl</surname>, <given-names>D.</given-names></string-name> (2017). <italic>Technical Report CIDDS-001 data set</italic>, <italic>001</italic>, pp. 1–13.</mixed-citation>
</ref>
<ref id="j_infor457_ref_052">
<mixed-citation publication-type="journal"><string-name><surname>Ring</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wunderlich</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Scheuring</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Landes</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Hotho</surname>, <given-names>A.</given-names></string-name> (<year>2019</year>). <article-title>A survey of network-based intrusion detection data sets</article-title>. <source>Computers &amp; Security</source>, <volume>86</volume>, <fpage>147</fpage>–<lpage>167</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.cose.2019.06.005" xlink:type="simple">https://doi.org/10.1016/j.cose.2019.06.005</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_053">
<mixed-citation publication-type="other"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name> (1957). <italic>The perceptron, a perceiving and recognizing automaton</italic>. Cornell Aeronautical Laboratory.</mixed-citation>
</ref>
<ref id="j_infor457_ref_054">
<mixed-citation publication-type="other"><string-name><surname>Rosenblatt</surname>, <given-names>F.</given-names></string-name> (1962). <italic>Principles of Neurodynamics; Perceptrons and the Theory of Brain Mechanisms</italic>. Spartan Books, Washington.</mixed-citation>
</ref>
<ref id="j_infor457_ref_055">
<mixed-citation publication-type="journal"><string-name><surname>Ross</surname>, <given-names>B.C.</given-names></string-name> (<year>2014</year>). <article-title>Mutual information between discrete and continuous data sets</article-title>. <source>PLoS ONE</source>, <volume>9</volume>(<issue>2</issue>), <fpage>87357</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0087357" xlink:type="simple">https://doi.org/10.1371/journal.pone.0087357</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_056">
<mixed-citation publication-type="chapter"><string-name><surname>Seabold</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Perktold</surname>, <given-names>J.</given-names></string-name> (<year>2010</year>). <chapter-title>Statsmodels: econometric and statistical modeling with python</chapter-title>. In: <source>9th Python in Science Conference</source>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_057">
<mixed-citation publication-type="chapter"><string-name><surname>Sharafaldin</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Habibi Lashkari</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (<year>2019</year>). <chapter-title>A detailed analysis of the CICIDS2017 data set</chapter-title>. In: <string-name><surname>Mori</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Furnell</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Camp</surname>, <given-names>O.</given-names></string-name> (Eds.), <source>Information Systems Security and Privacy</source>. <publisher-name>Springer International Publishing</publisher-name>, <publisher-loc>Cham</publisher-loc>, pp. <fpage>172</fpage>–<lpage>188</lpage>. <isbn>978-3-030-25109-3</isbn>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_058">
<mixed-citation publication-type="chapter"><string-name><surname>Sharafaldin</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Lashkari</surname>, <given-names>A.H.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (<year>2018</year>). <chapter-title>Toward generating a new intrusion detection dataset and intrusion traffic characterization</chapter-title>. In: <source>Proceedings of the 4th International Conference on Information Systems Security and Privacy</source>, Vol. <volume>1</volume>. <publisher-name>ICISSP</publisher-name>, <publisher-loc>Funchal, Madeira, Portugal</publisher-loc>, pp. <fpage>108</fpage>–<lpage>116</lpage>. <isbn>978-989-758-282-0</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.5220/0006639801080116" xlink:type="simple">https://doi.org/10.5220/0006639801080116</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_059">
<mixed-citation publication-type="other"><string-name><surname>Shetye</surname>, <given-names>A.</given-names></string-name> (2019). <italic>Feature Selection with Sklearn and Pandas</italic>. <ext-link ext-link-type="doi" xlink:href="https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b" xlink:type="simple">https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_060">
<mixed-citation publication-type="journal"><string-name><surname>Shiravi</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Shiravi</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Tavallaee</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ghorbani</surname>, <given-names>A.A.</given-names></string-name> (<year>2012</year>). <article-title>Toward developing a systematic approach to generate benchmark datasets for intrusion detection</article-title>. <source>Computers &amp; Security</source>, <volume>31</volume>(<issue>3</issue>), <fpage>357</fpage>–<lpage>374</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/J.COSE.2011.12.012" xlink:type="simple">https://doi.org/10.1016/J.COSE.2011.12.012</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_061">
<mixed-citation publication-type="journal"><string-name><surname>Smith</surname>, <given-names>M.R.</given-names></string-name>, <string-name><surname>Martinez</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Giraud-Carrier</surname>, <given-names>C.</given-names></string-name> (<year>2014</year>). <article-title>An instance level analysis of data complexity</article-title>. <source>Machine Learning</source>, <volume>95</volume>(<issue>2</issue>), <fpage>225</fpage>–<lpage>256</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10994-013-5422-z" xlink:type="simple">https://doi.org/10.1007/s10994-013-5422-z</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_062">
<mixed-citation publication-type="journal"><string-name><surname>Sokolova</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Lapalme</surname>, <given-names>G.</given-names></string-name> (<year>2009</year>). <article-title>A systematic analysis of performance measures for classification tasks</article-title>. <source>Information Processing and Management</source>, <volume>45</volume>, <fpage>427</fpage>–<lpage>437</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ipm.2009.03.002" xlink:type="simple">https://doi.org/10.1016/j.ipm.2009.03.002</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_063">
<mixed-citation publication-type="journal"><string-name><surname>Thakkar</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lohiya</surname>, <given-names>R.</given-names></string-name> (<year>2020</year>). <article-title>A review of the advancement in intrusion detection datasets</article-title>. <source>Procedia Computer Science</source>, <volume>167</volume>(<issue>2019</issue>), <fpage>636</fpage>–<lpage>645</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.procs.2020.03.330" xlink:type="simple">https://doi.org/10.1016/j.procs.2020.03.330</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_064">
<mixed-citation publication-type="other"><string-name><surname>Tharwat</surname>, <given-names>A.</given-names></string-name> (2018). Classification assessment methods. <italic>Applied Computing and Informatics</italic>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.aci.2018.08.003" xlink:type="simple">https://doi.org/10.1016/j.aci.2018.08.003</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_065">
<mixed-citation publication-type="other"><string-name><surname>The Cooperative Association for Internet Data Analysis</surname></string-name> (2010). <italic>CAIDA – The Cooperative Association for Internet Data Analysis</italic>. <uri>http://www.caida.org/home/</uri>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_066">
<mixed-citation publication-type="other"><string-name><surname>The Shmoo Group</surname></string-name> (2011). <italic>Defcon</italic>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_067">
<mixed-citation publication-type="other"><string-name><surname>Tomek</surname>, <given-names>I.</given-names></string-name> (1976). Two modifications of CNN. <italic>IEEE Transactions on Systems, Man and Cybernetics</italic>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TSMC.1976.4309452" xlink:type="simple">https://doi.org/10.1109/TSMC.1976.4309452</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_068">
<mixed-citation publication-type="journal"><string-name><surname>Wei</surname>, <given-names>J.M.</given-names></string-name>, <string-name><surname>Yuan</surname>, <given-names>X.J.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>Q.H.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.Q.</given-names></string-name> (<year>2010</year>). <article-title>A novel measure for evaluating classifiers</article-title>. <source>Expert Systems with Applications</source>, <volume>37</volume>(<issue>5</issue>), <fpage>3799</fpage>–<lpage>3809</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2009.11.040" xlink:type="simple">https://doi.org/10.1016/j.eswa.2009.11.040</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_069">
<mixed-citation publication-type="journal"><string-name><surname>Wilson</surname>, <given-names>D.L.</given-names></string-name> (<year>1972</year>). <article-title>Asymptotic properties of nearest neighbor rules using edited data</article-title>. <source>IEEE Transactions on Systems, Man, and Cybernetics</source>, <volume>SMC-2</volume>(<issue>3</issue>), <fpage>408</fpage>–<lpage>421</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TSMC.1972.4309137" xlink:type="simple">https://doi.org/10.1109/TSMC.1972.4309137</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_070">
<mixed-citation publication-type="journal"><string-name><surname>Witten</surname>, <given-names>I.H.</given-names></string-name>, <string-name><surname>Frank</surname>, <given-names>E.</given-names></string-name> (<year>2002</year>). <article-title>Data mining: practical machine learning tools and techniques with Java implementations</article-title>. <source>ACM SIGMOD Record</source>, <volume>31</volume>(<issue>1</issue>), <fpage>76</fpage>–<lpage>77</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/507338.507355" xlink:type="simple">https://doi.org/10.1145/507338.507355</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_071">
<mixed-citation publication-type="book"><string-name><surname>Witten</surname>, <given-names>I.H.</given-names></string-name>, <string-name><surname>Frank</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Hall</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Pal</surname>, <given-names>C.J.</given-names></string-name> (<year>2005</year>). <source>Data Mining: Practical Machine Learning Tools and Techniques</source>, <edition>2</edition>nd ed. <publisher-name>Morgan Kaufmann Publishers</publisher-name>, <publisher-loc>San Francisco</publisher-loc>, pp. <fpage>558</fpage>. <isbn>0-12-088407-0</isbn>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_072">
<mixed-citation publication-type="journal"><string-name><surname>Yulianto</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sukarno</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Suwastika</surname>, <given-names>N.A.</given-names></string-name> (<year>2019</year>). <article-title>Improving AdaBoost-based intrusion detection system (IDS) performance on CIC IDS 2017 dataset</article-title>. <source>Journal of Physics: Conference Series</source>, <volume>1192</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1088/1742-6596/1192/1/012018" xlink:type="simple">https://doi.org/10.1088/1742-6596/1192/1/012018</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor457_ref_073">
<mixed-citation publication-type="other"><string-name><surname>Zhang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Cheng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>G.</given-names></string-name> (2018). Deep sparse autoencoder for feature extraction and diagnosis of locomotive adhesion status. <italic>Journal of Control Science and Engineering</italic>, 1–9. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1155/2018/8676387" xlink:type="simple">https://doi.org/10.1155/2018/8676387</ext-link>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
