1 Introduction
Classifying objects into classes is a common problem in many research fields. A plethora of supervised machine learning algorithms have been developed: Logistic Regression, Decision Trees, Random Forests (RF), Naive Bayes, Support Vector Machines, Neural Networks and many more. We typically evaluate these algorithms on their predictive performance by comparing predicted and true classes and evaluating measures such as accuracy, sensitivity, and specificity (Han
et al.,
2012). This is because we are seeking to find the optimal predictive model, which means finding the model with the highest prediction performance.
Hyperparameter tuning is a crucial step in building an optimal predictive model, as most machine learning algorithms depend on carefully chosen hyperparameter values that significantly impact their performance (Bischl
et al.,
2023). For example, when applying RF, the researcher needs to decide on the number of trees and the number of features to sample at each split point. Additionally, other subjective decisions or subjective inputs are required during the model-building process, such as determining the initial set of (relevant) predictors to include. Hyperparameter tuning is usually performed by selecting values that optimize a predetermined metric, such as accuracy, and this process can be automatized (Feurer and Hutter,
2019). Other decisions, such as selecting relevant predictors, can be theory-driven and do not necessarily involve hyperparameters, yet can still significantly impact the model and its classification results. The greater the impact of these decisions, the more sensitive the algorithm becomes to hyperparameter values, resulting in an increased diversity of the resulting classifications. This paper focuses on the study of this diversity when applying different hyperparameter values. We propose a data-driven methodology for assessing similarity between two classification algorithms, conditional on the dataset on which the algorithm is applied. This means that classifiers are evaluated through predicted classifications. The novelty of our approach stems from:
-
(a) evaluating each classification algorithm not as a single representative classifier, but as a collection of classifiers built under the same algorithm with varying hyperparameter values,
-
(b) accounting for the diversity among the corresponding classifiers associated with each binary classification algorithm.
The core of this research is therefore in evaluating similarity between two binary classification algorithms when taking into account different hyperparameter values and subjective inputs related to a specific algorithm. Accounting for the diversity of the classifiers associated with each binary classification algorithm imposes two main research questions:
-
(R1) How to evaluate the similarity of classifiers for a given binary classification algorithm when applying different hyperparameter values or different subjective inputs?
-
(R2) How to evaluate the classification similarity between two binary classification algorithms when applying different hyperparameter values or different subjective inputs for each algorithm?
In general, research question (R1) relates to the problem of evaluating similarity between
k binary classifiers, while research question (R2) can be generalized to the problem of comparing two binary classifier sets. We focus on evaluating the similarity in consensus agreement where we quantify the extent to which all classifiers unanimously predict the same label for a given instance. The formal definition of the consensus agreement is presented in Section
3. We address these problems by evaluating the similarity through consensus agreement and by proposing the application of asymmetric similarity indices based on the Jaccard coefficient established in Perišić and Vanbelle (
2024). Approaches for measuring similarity of binary classifiers are presented in Section
2, while Section
3 presents the basic theoretical framework related to binary classifiers and consensus agreement, and the methodological framework. In Section
4 we present the application of the proposed framework to publicly available datasets. Section
5 comprises the discussion, conclusion, and propositions for future work.
2 Comparing Binary Classification Algorithms
When measuring similarity, we can have different approaches to what similarity or dissimilarity means. These different approaches can be a result of different definitions of similarity, different fields of application, or different goals of the analysis. Moreover, across different fields, the nomenclature regarding the similarity assessment is not uniform. When evaluating the similarity of two binary classifiers, researchers have used terms (dis)similarity, resemblance, diversity, (dis)agreement. Thus, we are frequently working with the same concept, but using different names.
Evaluating similarity mostly implies quantification of similarity, which is often operationalized by applying a certain similarity measure. The broad application of measuring similarity resulted in a plethora of (dis)simmilarity/agreement/resemblance/diversity measures (see for instance, Choi
et al.,
2010). Some of these measures have varying terminology across different fields or their application (for instance, the proportion of agreement and the simple matching coefficient).
When building classification models, especially on imbalanced datasets, we are often more interested in correctly identifying 1s (e.g. presence) than 0s. In such cases, we focus on one class rather than both classes equally. This can be achieved using asymmetric similarity coefficients, the most common being the Jaccard coefficient, which has been used for measuring the agreement between two classifiers. Extensions of this coefficient to multiple classifiers include the
k-adic Jaccard coefficient (Warrens,
2009) and the 2-group
k-adic Jaccard coefficient (Perišić and Vanbelle,
2024).
Many measures of the connection between two classifier outputs can be derived from the statistical literature, but there is less clarity on the subject when three or more classifiers are concerned (Kuncheva and Whitaker,
2003). Often, researchers simplify the problem of comparing
$k\gt 2$ classifiers by evaluating pairwise similarities where they compare 2 classifiers and then take a weighted linear combination. Also, there have been extensions of some well-known similarity coefficients, like the simple matching and the Jaccard coefficient (Warrens,
2009). When it comes to measuring the similarity between two sets of classifiers, the situation worsens. This problem is often reduced to measuring the similarity of two classifiers by applying consensus-based methods. The consensus-based practice determines empirically a consensus category in each group of classifiers and reduces the problem to the case of two classifiers (Vanbelle and Albert,
2009). There have been some extensions of the well-known similarity measures, for instance the Cohen’s kappa coefficient (Vanbelle and Albert,
2009), the Jaccard coefficient, and the Simple matching coefficient (Perišić and Vanbelle,
2024).
Evaluating similarity between binary classifiers can serve different purposes. In addition to comparing technical aspects such as the complexity and execution time of the underlying algorithm, one can focus on comparing their predictive performances or examining their agreement when classifying objects. Researchers have adopted different approaches, focusing on only one or more features for comparison. For instance, Makhtar
et al. (
2011) argue that in order to assess the similarity of predictive models as a whole, it is necessary to assess the similarity of their individual elements (Input, Function and Output) independently. They proposed a methodology to measure classifier similarity based on datasets, model functions, and confusion matrices. This includes the Dataset Similarity Coefficient, Similarity of Predictive Model Function, and Similarity of Output, which are combined into a single overall similarity coefficient.
The agreement between two classifiers can generally be evaluated by comparing the predicted classes for each object, as well as their agreement with reference values. When these reference values are the true classes, the agreement essentially reflects the predictive performance of an algorithm. Rather than comparing the agreement between two classifiers in general, researchers often focus on comparing the agreement between predicted and true classes, which is the basis of the prediction performance assessment, using performance measures like accuracy, sensitivity or AUC (see for instance, Labatut and Cherifi,
2012), the Matthews correlation coefficient, the Brier score, Cohen’s Kappa (Chicco
et al.,
2021). This is due to the fact that when building a prediction model, we aim to find a classifier that has the highest prediction performance. Given a dataset, the classifier with the highest prediction metrics is selected. As already noted, this procedure can be viewed as assessing the agreement between predicted classes and true classes.
Different procedures have been proposed for comparing multiple classifiers. For instance, the Friedman test (Demšar,
2006) followed by tests for multiple comparisons such as the mean-ranks test, sign-test or the Wilcoxon signed-rank test (Benavoli
et al.,
2016). A broad range of studies analysed classification metrics in terms of specific characteristics and properties, such as coherency of metrics used for evaluating classifiers, consistency in binary classification, robustness of binary classification performance metrics, etc. Some examples can be found in Shirdel
et al. (
2024).
When assessing the agreement between two classifiers in terms of agreement with true classes, we can further distinguish between correct and incorrect predicted classifications. For instance, Petrakos
et al. (
2001) have separately assessed agreement on correct classification and agreement on incorrect classification by matching the results from individual classifiers. They organize the information as a cross-classification table and then apply the proportion of agreement (the simple matching coefficient), the proportion of specific agreement, and the kappa statistic to assess classifier agreement. Wang
et al. (
2023) established a framework that allows for visual comparison of two classification models and enables the interpretation of the two models’ behaviour discrepancy on different feature conditions. This framework identifies data instances with disagreed predictions from the two compared classifiers and trains a discriminator to learn from the disagreed instances.
Comparing two classifiers by assessing the similarity in their predicted classifications has gained attention in the optimization of ensemble algorithms, often referred to as measuring ensemble diversity. Measuring ensemble diversity, as measuring classifier agreement or similarity, is not straightforward, and as a result, there is no universally accepted formal definition of diversity, nor a universally accepted measure. A natural way to measure the diversity of classifiers is to measure how much their classification decisions differ. Again, this difference can be compared by comparing the predicted classifications, i.e. comparing the predicted class of each object, or by examining the diversity in correct and incorrect classifications. Researchers have proposed a number of ways to measure diversity in ensembles of classifiers (e.g. Kuncheva and Whitaker,
2003; Wood
et al.,
2023; Tsymbal
et al.,
2005) and to generalize a diversity measure between two classifiers to an ensemble of more than two classifiers. For instance, when considering more than two classifiers, we can differentiate between measures averaged over pairs of classifiers and measures defined on the ensemble of classifiers as a whole (Narasimhamurthy,
2005). Tsymbal
et al. (
2005) considered different measures of ensemble diversity, where they differentiated between the pairwise (that measure diversity in predictions of a pair of classifiers, or the total ensemble diversity, which is the average of all the classifier pairs in the ensemble), and non-pairwise (which measure diversity in predictions of the whole ensemble only). In the binary classification case, they propose the simple matching coefficient (referred to as the plain disagreement and the fail/non-fail disagreement), the Yule’s Q statistics, the correlation coefficient and the kappa statistic. The non-pairwise measures rely on entropy and ambiguity. Kuncheva and Whitaker (
2003) studied ten statistics which can measure diversity among binary classifier outputs (in terms of correct or incorrect vote for the class label): four averaged pairwise measures (the Q statistic, the correlation, the disagreement measure and the double fault measure) and six non-pairwise measures (the entropy of the votes, the difficulty index, the Kohavi-Wolpert variance, the interrater agreement, the generalized diversity, and the coincident failure diversity). All these pairwise measures have been proposed as measures of (dis)similarity in the numerical taxonomy literature (Sokal and Sneath,
1963). A theoretical analysis on six existing diversity measures (namely disagreement measure, double fault measure, KW variance, inter-rater agreement, generalized diversity and measure of difficulty) and some underlying relationships between them can be found in Tang
et al. (
2006). Also, one of the most widely used measures of classifier agreement is the Cohen’s kappa coefficient which was also used as a measure of classifier diversity (Margineantu and Dietterich,
1997; Zouari
et al.,
2005).
4 Application
This section presents the application of the k-adic Jaccard similarity coefficient and the 2-group k-adic Jaccard similarity coefficient for measuring similarity in consensus agreement for a binary classification algorithm, and between two binary classification algorithms respectively. A part of the calculations was conducted in Petričević (
2023). We consider datasets with a binary response variable and different features used as predictors to build a predictive model for classification purposes with the application of different classification algorithms. We involve different hyperparameter values/inputs for each classification algorithm and then evaluate the similarity in consensus agreement. We employ three classification algorithms: logistic regression (LR), random forests (RF) and conditional random forests (CF). The logistic regression models were built with four different sets of predictors. Thus, the first set of binary classifiers comprises four logit-based classifiers. The second set of binary classifiers comprises random forest classifiers built under different hyperparameter values: the number of decision trees in the forest and the number of features considered by each tree when splitting a node. For building prediction models, we used the R package
randomForest (Liaw and Wiener,
2002). The third set of binary classifiers again comprises random forest classifiers, but built under the conditional random forest framework (Strobl
et al.,
2007,
2008). The resulting forests are unbiased and overcome variable-selection bias which is the major weak spot of the classical approaches. The variable selection bias refers to biasedness in preferring continuous variables and variables with many categories. Again, the models were built under different hyperparameter values: number of decision trees in the forest and the number of features considered by each tree when splitting a node. The models were built using the R package
party (Strobl
et al.,
2009).
We perform the calculations on a balanced dataset and a highly imbalanced dataset by employing two publicly available datasets:
heart disease (HD) dataset (Janosi
et al.,
1988) and
stroke prediction (SP) dataset (Kaggle,
2023). We randomly split both datasets into a train and test dataset where
$70\% $ of data are used for training the models. We assess the prediction performance of each classifier on the test dataset by evaluating AUC, accuracy, sensitivity, specificity, precision, and the F1 measure. We need to emphasize that our main goal is not to find the best prediction model, but to construct three collections of predictive models. Thus, some models presented may have lower performance than expected, i.e. by fine-tuning we could find better models. We next describe the results separately for each application.
4.1 Application I: The HD Dataset
The HD dataset comprises 13 features and 297 instances (after data cleaning). The goal is to build a predictive model, i.e. a binary classifier, for distinguishing presence (classified as 1) and absence (classified as 0) of heart disease. The binary response variable is balanced with $46\% $ of instances being classified as 1 and $54\% $ being classified as 0.
The first set of binary classifiers comprises four logistic regression models (logit based classifiers). The multivariable logit models built for the heart disease dataset were built in a stepwise fashion with a combination of forward and backward selection. The criterion for feature selection was the AIC. We evaluated 4 models: the full model (
lr_heart1), and three stepwise models with the three lowest AIC values (
lr_heart2,
lr_heart3,
lr_heart3). These models differ in the predictors included as presented in Table
1. The disease is classified as present if the estimated probability exceeds the threshold of
$p=0.5$.
Table 1
Predictors in logit models (+ = included, − = excluded) for the HD data.
|
Logit models |
| Predictor |
lr_heart1 |
lr_heart2 |
lr_heart3 |
lr_heart4 |
| age |
+ |
− |
− |
− |
| sex |
+ |
+ |
+ |
+ |
| cp |
+ |
+ |
+ |
+ |
| trestbps |
+ |
+ |
+ |
+ |
| chol |
+ |
+ |
+ |
+ |
| fbs |
+ |
− |
− |
− |
| restecg |
+ |
− |
− |
− |
| thalach |
+ |
− |
+ |
− |
| exang |
+ |
+ |
+ |
− |
| oldpeak |
+ |
− |
− |
− |
| slope |
+ |
+ |
+ |
+ |
| ca |
+ |
+ |
+ |
+ |
| thal |
+ |
+ |
+ |
+ |
The second set of binary classifiers comprises nine random forest classifiers built under different hyperparameter values: number of decision trees in the forest $n\in \{50,200,500\}$, and the number of features considered by each tree when splitting a node $m\in \{2,4,10\}$.
The third set of binary classifiers comprises nine random forest classifiers built under the conditional random forest framework with different hyperparameter values: number of decision trees in the forest
$n\in \{50,200,500\}$, and the number of features considered by each tree when splitting a node
$m\in \{2,4,10\}$. The performance metrics for the three sets of classifiers are summarized in Table
2.
Table 2
Model valuation for the HD data; cv presents the coefficient of variation for a prediction metric over a set of classifiers.
| Model |
Accuracy |
Sensitivity |
Specificity |
Precision |
F1 |
AUC |
| LR |
lr_heart1 |
0.82 |
0.85 |
0.79 |
0.79 |
0.82 |
0.82 |
| lr_heart2 |
0.80 |
0.85 |
0.74 |
0.76 |
0.80 |
0.80 |
| lr_heart3 |
0.81 |
0.85 |
0.77 |
0.77 |
0.81 |
0.81 |
| lr_heart4 |
0.80 |
0.83 |
0.77 |
0.77 |
0.80 |
0.80 |
| cv (%) |
1.24 |
1.28 |
2.14 |
1.68 |
1.15 |
1.22 |
| CF |
$m=2$ |
$n=50$ |
0.82 |
0.88 |
0.77 |
0.78 |
0.82 |
0.82 |
| $n=200$ |
0.80 |
0.85 |
0.74 |
0.76 |
0.80 |
0.80 |
| $n=500$ |
0.80 |
0.85 |
0.74 |
0.76 |
0.80 |
0.80 |
| $m=4$ |
$n=50$ |
0.81 |
0.85 |
0.77 |
0.77 |
0.81 |
0.81 |
| $n=200$ |
0.82 |
0.85 |
0.79 |
0.79 |
0.82 |
0.82 |
| $n=500$ |
0.77 |
0.85 |
0.70 |
0.72 |
0.78 |
0.77 |
| $m=10$ |
$n=50$ |
0.78 |
0.85 |
0.72 |
0.74 |
0.79 |
0.79 |
| $n=200$ |
0.78 |
0.85 |
0.72 |
0.74 |
0.79 |
0.79 |
| $n=500$ |
0.78 |
0.85 |
0.72 |
0.74 |
0.79 |
0.79 |
| cv (%) |
2.02 |
0.92 |
3.75 |
2.75 |
1.67 |
1.96 |
| RF |
$m=2$ |
$n=50$ |
0.81 |
0.90 |
0.72 |
0.75 |
0.82 |
0.81 |
| $n=200$ |
0.84 |
0.90 |
0.79 |
0.80 |
0.85 |
0.85 |
| $n=500$ |
0.83 |
0.88 |
0.79 |
0.80 |
0.8 |
0.83 |
| $m=4$ |
$n=50$ |
0.84 |
0.90 |
0.79 |
0.80 |
0.85 |
0.85 |
| $n=200$ |
0.82 |
0.90 |
0.74 |
0.77 |
0.83 |
0.82 |
| $n=500$ |
0.82 |
0.88 |
0.77 |
0.78 |
0.82 |
0.82 |
| $m=10$ |
$n=50$ |
0.76 |
0.80 |
0.72 |
0.73 |
0.76 |
0.76 |
| $n=200$ |
0.77 |
0.83 |
0.72 |
0.73 |
0.78 |
0.77 |
| $n=500$ |
0.77 |
0.83 |
0.72 |
0.73 |
0.78 |
0.77 |
|
cv (%) |
3.79 |
4.3 |
4.12 |
3.7 |
3.79 |
3.79 |
For each algorithm, we evaluate the within-set similarity in consensus agreement on presence by calculating the k-adic Jaccard coefficient. The results are presented in Table
3 with 95% confidence intervals presented in brackets. The LR based classifiers have the highest level of within-set similarity, while the RF based classifiers have the lowest level of within-set similarity. This means that classifications predicted by LR classifiers are less affected by subjective decisions on the number of predictors compared to the sensitivity of RF and CF classifications when changing the hyperparameter values, all with respect to consensus agreement on the presence. Also, we can conclude that the RF based classifiers are more sensitive to changing hyperparameter values than the CF based classifiers, since the within-set similarity in consensus agreement is lower for the RF based classifiers. These results are consistent with the dispersion of the performance measures presented in Table
2. The accuracy, specificity, precision, AUC and the F1 measure have the highest dispersion level for the RF algorithm, and the lowest for the LR algorithm.
Table 3
Within-set similarity for the HD data, confidence intervals presented in brackets.
| Classifier set |
k-adic Jaccard |
| LR |
0.90 $(0.81,0.99)$
|
| CF |
0.83 $(0.72,0.95)$
|
| RF |
0.77 $(0.64,0.89)$
|
The between-set similarity in consensus agreement on presence is calculated by applying the 2-group k-adic similarity coefficient. Results are presented in Table
4 with 95% confidence intervals presented in brackets. The similarity in consensus agreement is similar between pairs of algorithms, with slightly lower similarity between LR and CF. This means that merging classifiers LR and CF would result in less preserved mean within-set similarity in consensus agreement than merging classifiers LR and RF or RF and CF.
Table 4
The between-set similarity in consensus agreement for the HD data.
| Classifier set |
CF |
RF |
| LR |
0.87 $(0.79,0.95)$
|
0.90 $(0.83,0.97)$
|
| CF |
1 |
0.90 $(0.82,0.96)$
|
4.2 Application II: The SP Dataset
The SP dataset comprises 10 features and 4908 instances (after data cleaning). The predictive models are binary classifiers, classifying instances likely to have a stroke as 1 and those unlikely as 0. The dataset is highly imbalanced where $4\% $ of instances are classified as 1 and $96\% $ are classified as 0. We performed a combination of random undersampling and oversampling for the training data for the SP dataset, which resulted in a balanced training set.
Table 5
Predictors in logit models (+ = included, − = excluded) for the SP dataset.
|
Model |
| Predictor |
lr_stroke1 |
lr_stroke2 |
lr_stroke3 |
lr_stroke4 |
| gender |
+ |
− |
+ |
− |
| age |
+ |
+ |
+ |
+ |
| hypertension |
+ |
+ |
+ |
+ |
| heart_disease |
+ |
− |
− |
− |
| ever_married |
+ |
− |
− |
− |
| work_type |
+ |
+ |
+ |
+ |
| Residence_type |
+ |
− |
− |
− |
| avg_glucose_level |
+ |
+ |
+ |
+ |
| bmi |
+ |
+ |
+ |
− |
| smoking_status |
+ |
+ |
+ |
+ |
We build the prediction models for the SP data following a similar approach to that used for the HD data (see
4.1). The set of the LR based classifiers comprises the full model (
lr_stroke1), and three stepwise models: the best performing stepwise model constructed with the combination of forward and backward elimination (
lr_stroke2), the second best model obtained by the stepwise method with backward elimination (
lr_stroke3), and the second best model obtained by the stepwise method with forward elimination (
lr_stroke4). These models differ in predictors (features) included, as presented in Table
5. The RF and CF classifiers are also built in the same fashion as for the HD dataset (see Appendix
4.1). The only difference is in the hyperparameter values related to the number of features considered by each tree when splitting a node
$m\in \{2,4,8\}$. The performance metrics for evaluated classifiers are summarized in Table
6.
Table 6
Model valuation for the SP data.
| Model |
Accuracy |
Sensitivity |
Specificity |
Precision |
F1 |
AUC |
| LR |
lr_stroke1 |
0.74 |
0.74 |
0.75 |
0.99 |
0.85 |
0.75 |
| lr_stroke2 |
0.75 |
0.75 |
0.77 |
0.99 |
0.85 |
0.76 |
| lr_stroke3 |
0.74 |
0.74 |
0.74 |
0.99 |
0.85 |
0.74 |
| lr_stroke4 |
0.75 |
0.75 |
0.75 |
0.99 |
0.85 |
0.75 |
| cv (%) |
0.37 |
0.34 |
1.65 |
0.07 |
0.21 |
0.96 |
| CF |
90$m=2$
|
$n=50$ |
0.79 |
0.80 |
0.65 |
0.98 |
0.88 |
0.72 |
| $n=200$ |
0.79 |
0.80 |
0.68 |
0.98 |
0.88 |
0.74 |
| $n=500$ |
0.79 |
0.80 |
0.68 |
0.98 |
0.88 |
0.74 |
| $m=4$ |
$n=50$ |
0.82 |
0.83 |
0.63 |
0.98 |
0.90 |
0.73 |
| $n=200$ |
0.83 |
0.83 |
0.67 |
0.98 |
0.90 |
0.75 |
| $n=500$ |
0.82 |
0.83 |
0.67 |
0.98 |
0.90 |
0.75 |
| $m=8$ |
$n=50$ |
0.83 |
0.84 |
0.58 |
0.98 |
0.91 |
0.71 |
| $n=200$ |
0.83 |
0.84 |
0.60 |
0.98 |
0.90 |
0.72 |
| $n=500$ |
0.83 |
0.84 |
0.58 |
0.98 |
0.90 |
0.71 |
| cv (%) |
1.93 |
2.11 |
6.36 |
0.17 |
1.11 |
1.91 |
| RF |
$m=2$ |
$n=50$ |
0.86 |
0.88 |
0.40 |
0.97 |
0.92 |
0.64 |
| $n=200$ |
0.86 |
0.88 |
0.44 |
0.98 |
0.92 |
0.66 |
| $n=500$ |
0.86 |
0.88 |
0.46 |
0.98 |
0.92 |
0.67 |
| $m=4$ |
$n=50$ |
0.92 |
0.95 |
0.23 |
0.97 |
0.96 |
0.59 |
| $n=200$ |
0.92 |
0.950 |
0.26 |
0.97 |
0.96 |
0.61 |
| $n=500$ |
0.92 |
0.95 |
0.25 |
0.97 |
0.96 |
0.60 |
| $m=8$ |
$n=50$ |
0.91 |
0.94 |
0.19 |
0.97 |
0.95 |
0.57 |
| $n=200$ |
0.91 |
0.94 |
0.23 |
0.97 |
0.96 |
0.58 |
| $n=500$ |
0.91 |
0.94 |
0.23 |
0.97 |
0.95 |
0.58 |
| cv (%) |
3.06 |
3.52 |
30.63 |
0.33 |
1.52 |
5.43 |
We evaluate the within-set similarity in consensus agreement on presence by calculating the k-adic Jaccard coefficient. The results are presented in Table
7. As in the case of the HD data, the LR based classifiers have the highest level of within-set similarity, while the RF based classifiers have the lowest level of within-set similarity. Again, this means that classifications predicted by LR based classifiers are less affected by subjective decisions on the number of predictors compared to the sensitivity of RF and CF classifications when changing the hyperparameter values, all with respect to consensus agreement on the presence. Also, as in the case of the HD dataset, we can conclude that classifications predicted by RF based classifiers are more sensitive to changing hyperparameter values than classifications predicted by the CF based classifiers, since the within-set similarity in consensus agreement is lower for the RF based classifiers. Compared to the HD dataset case, in the case of the SP data, the RF and CF algorithms exhibit notably higher sensitivity to hyperparameter values (due to the lower values of the similarity coefficient), all in terms of the within-set similarity in consensus agreement. These results are consistent with the dispersion of the performance measures presented in Table
6. All performance measures have the highest dispersion level for the RF algorithm, and the lowest for the LR algorithm.
Table 7
Within-set similarity for the SP data.
| Classifier set |
k-adic Jaccard |
| LR |
0.92 $(0.90,0.95)$
|
| CF |
0.63 $(0.58,0.68)$
|
| RF |
0.24 $(0.18,0.29)$
|
The between-set similarity in consensus agreement on presence is calculated by applying the 2-group k-adic similarity coefficient, and the results are presented in Table
8 with 95% confidence intervals in brackets. LR-based and CF-based algorithms have the highest level of similarity in consensus agreement on presence, while the LR-based algorithm and the RF-based algorithm have the lowest level of similarity in consensus agreement on presence.
Table 8
The between-set similarity in consensus agreement for the SP data.
| Classifier set |
CF |
RF |
| LR |
0.61 $(0.57,0.65)$
|
0.22 $(0.18,0.27)$
|
| CF |
1 |
0.36 $(0.29,0.41)$
|
5 Conclusion and Discussion
Binary classification is a fundamental task in machine learning, with broad applications across various scientific and practical fields. A vast and growing number of developed binary classification algorithms supports the importance and encourages the application of binary classification methods. The extensive collection of developed algorithms challenges researchers to develop methods to find the best-suiting algorithm for a particular problem and the dataset of interest. This typically involves comparing the predictive performance of binary classification models.
Building an effective prediction model typically requires hyperparameter tuning, meaning that when evaluating a single binary classification algorithm with different hyperparameter values, we are effectively assessing a collection of classifiers.
This research makes a significant contribution to the comparison of binary classification algorithms. Instead of focusing on prediction performances, it compares the resulting classifications. The core of our approach lies in considering the variability introduced by different hyperparameter values for each algorithm when performing such comparisons. This means that, rather than selecting a single representative classifier for each binary classification algorithm (typically the one with the best prediction performance), we consider a set of classifiers as being a representative set of a binary classification algorithm. This approach accounts for the variability of that set.
The comparison of two binary classification algorithms is performed by applying the 2-group k-adic similarity coefficient. This approach is based on the k-adic similarity approach, or the De Moivre definition of agreement. The variability within each classifier set is quantified by calculating the k-adic similarity coefficients. Specifically, we apply the k-adic similarity coefficient to assess the within-set similarity, which quantifies the similarity of k classifiers related to a particular classification algorithm. Furthermore, when classifiers are built under a certain classification algorithm by varying hyperparameter values and subjective inputs, the k-adic similarity coefficient can be used to assess the sensitivity of an algorithm to changes in these values.
The 2-group k-adic Jaccard coefficient and the k-adic Jaccard coefficient have a broad potential of application which has yet to be explored. For instance, it could be of interest to evaluate the within-set similarity of a classifier set of a classification across different hyperparameter spaces. This can be performed by calculating the k-adic Jaccard coefficient for different $\mathcal{H},$ and could be used as a measure of robustness. In future research, we plan to relax the strict consensus-based approach of the k-adic similarity coefficient. While this method evaluates the agreement between classifiers, we plan to develop similarity coefficients that can measure partial agreement between two sets of classifiers.
The proposed methodology applies the asymmetric similarity coefficients. However, when the goal of the analysis is to take into account both agreement on presence and agreement on absence, we involve the symmetric coefficients, like the simple matching coefficient. A version of this coefficient that can be applied for quantifying binary classifier algorithms similarity can be found in Perišić and Vanbelle (
2024).
The presented methodology has several limitations and raises important open questions. The first one relates to the selection of hyperparameter values. The hyperparameter space is multidimensional, with each dimension representing a specific hyperparameter. The same complexity applies for the space related to subjective input values. Key questions to address include:
-
• How many distinct hyperparameters should be chosen when forming a set of binary classifiers for some classification algorithm?
-
• What criteria should guide the selection of hyperparameter values?
-
• Is it justified to compare the sensitivity to hyperparameters when different hyperparameters or subjective inputs are considered for each algorithm?
For example, in the application described in Section 4, logistic regression was found to be less sensitive to variations in hyperparameters and inputs, while random forest was more sensitive. Logistic regression classifiers were created by selecting different predictors during model building, whereas random forests were formed by altering parameters such as the number of trees in the forest and the number of features considered at each split. This suggests that it may not always be appropriate to consider the same hyperparameter dimensions across algorithms. Thus, comparing algorithm sensitivity to changing hyperparameter values will sometimes imply evaluating different hyperparameter dimensions for each classification algorithm. However, we should be aware that results on algorithm sensitivity could be affected by selected hyperparameter dimensions.
Another limitation arises when comparing classifier sets of different sizes. If we assess the within-set similarity of a collection of classifiers derived from various hyperparameter settings for a given algorithm, increasing the number of classifiers in the set (i.e. evaluating a broader range of hyperparameter values) could increase the heterogeneity within the set. This is evident in the examples provided in Appendix
A. Consequently, the selection of both the number of classifiers and the specific hyperparameter values should be approached with care, particularly if no predefined rule dictates which hyperparameters are of interest.