In this section, we present experimental results to validate our proposed approaches. We evaluate the ML metrics (including precision, recall, specificity, negative predictive value, as well as precision-recall and receiver operating characteristic curves), the correctness of state classification and computational efficiency across diverse DTMC models. Furthermore, we compare the proposed approaches with the state-of-the-art.
5.1 Experimental Setup
We evaluate our approaches using four standard DTMC model classes:
Nand,
brp,
egl, and
crowds, as documented in Kwiatkowska
et al. (
2012), Hartmanns
et al. (
2019), Budde
et al. (
2020b). The reachability properties under investigation are:
-
• P=?[F ! knowA & knowB] for egl cases,
-
• P=?[F observe0 > 1] for crowds cases,
-
• P=?[F s = 4 & z/N < 0.1] for Nand cases,
-
• P=?[F s = 5] for brp cases.
We implemented our algorithms in PRISM model checker, constructing models with varying parameter values. For fair comparison, we disabled pre-computation of ${S^{0}}$ and ${S^{1}}$ sets. To provide the reproducibility of our results and to maintain a balance between detection performance and computational overhead, we configured the DT classifiers with the following hyperparameters: Max Depth: limited to values between 4 and 10 (depending on model complexity) to avoid overfitting to specific simulation paths; Criterion: we used Gini impurity to measure the quality of splits, as it is computationally efficient for large state spaces; Random State: a fixed seed 0 was utilized for the training process to provide the reproducibility of the results across multiple executions. The other hyperparameters are set to their default values.
To evaluate the performance of the DT classifier, we employed a series of standard machine learning metrics. For this analysis, $10\% $ of the whole set of states was allocated for training the model, while the remaining $90\% $ was reserved for testing its predictive accuracy. Precision measures the reliability of a positive prediction of ${S^{0}}$. It answers: of all states of the model labelled as ${S^{0}}$, how many were actually ${S^{0}}$? Recall (true positive rate) measures the model’s ability to find all positive instances, i.e. of all ${S^{0}}$ states that exist, how many did the model successfully find? Specificity measures the model’s ability to correctly identify negative instances, of all the non-${S^{0}}$ (${S^{?}}$) states, how many were correctly labelled as such? Negative Predictive Value (NPV) measures the reliability of a negative prediction. It answers: of all states the model labelled as ${S^{?}}$, how many were truly ${S^{?}}$? False Positive Rate (FPR) answers a different question, i.e. of all the cases that were actually ${S^{?}}$, how many did the model incorrectly flag as ${S^{0}}$?. Finally, we used the Receiver Operating Characteristic (ROC) curve to visualize the classifier’s performance across various threshold settings. By plotting recall against FPR, the ROC curve illustrates the model’s fundamental ability to distinguish between classes.
5.2 Experimental Results
To evaluate the
${S^{0}}$ detection, the
egl model with parameters
$N=5$ and
$L=6$, as well as Algorithm
3 (V3) is considered. In the
egl model with the predefined parameters, each state includes 84 feature values, used for training the DT. With a Precision of
$97.5\% $, we can deduce that when the model identifies something as
${S^{0}}$, it is almost certainly correct. This high level of certainty is backed by a Specificity of
$95.1\% $, meaning it rarely confuses the background with our target. The trade-off for this high accuracy is that the model is somewhat conservative; it only catches about
$67.4\% $ of all actual
${S^{0}}$ instances. In contrast, the model takes a wide-net approach for
${S^{?}}$. It achieves a high recall of
$95\% $, ensuring that very few actual
${S^{?}}$ instances are missed. However, this inclusiveness comes at a cost to precision, a fact further reflected by the
$50.9\% $ NPV. This tells us that while the model finds almost every
${S^{?}}$ case, about half of those predictions are actually false alarms. Ultimately, with F1-scores of
$80\% $ and
$66\% $, the data shows a classifier that is fine-tuned to be an expert at identifying the target class while serving as a sensitive, albeit less precise, detector for the rest. These findings are consistent with the discrete operating points observed in the precision-recall and ROC curves illustrated in Fig.
7.
Figure
7 illustrates a quantitative evaluation of the DT for identifying
${S^{0}}$ states by demonstrating precision-recall and ROC curves. Our evaluation reveals that the classifier is both reliable and highly discerning, achieving an average precision (AP) of 0.90 and an area under the curve (AUC) of 0.81. What stands out most is the model’s sweet spot at approximately
$68\% $ recall. At this point, the classifier maintains nearly perfect precision while keeping false alarms to a remarkably low
$5\% $. While we can push the model to catch every single positive instance (
$100\% $ recall), it does so at a reasonable cost, with precision dipping to 0.74. Throughout this range, the model remains remarkably stable, consistently holding an F1-score of about
$80\% $. Essentially, these curves describe a classifier that is good at making high-confidence predictions. These results demonstrate that the constructed DT provides good separation, suggesting that our proposed DT-based methods are promising.

Fig. 7
Quantitative performance evaluation of the decision tree classifier for detecting ${S^{0}}$ in the case of model egl with $N=5$ and $L=6$. (a) Precision-recall curve showing the trade-off between precision and recall, overlaid with Iso-F1 contours. (b) ROC curve plotting the recall against the FPR. The model attains an AUC of 0.81, demonstrating discriminatory power significantly superior to the random classifier baseline.
Table 1
Correctness of the proposed approach for the selected DTMC models.
| Model name |
Parameter values |
$|S|$ |
$|{S^{0}}|$ |
Algorithm1
|
Algorithm2
|
Algorithm3 (V1) |
Algorithm3 (V2) |
Algorithm3 (V3) |
| Correctness |
$|\approx {S^{0}}|$ |
Correctness |
$|\approx {S^{0}}|$ |
Correctness |
$|\approx {S^{0}}|$ |
Correctness |
$|\approx {S^{0}}|$ |
Correctness |
$|\approx {S^{0}}|$ |
| egl |
$N=5$, $L=6$
|
115 710 |
85 376 |
$\gt 99.9\% $ |
17 |
$\gt 99.9\% $ |
85 605 |
$98.4\% $ |
87 209 |
$\gt 99.9\% $ |
86 919 |
$98.4\% $ |
97 457 |
|
$N=5$, $L=8$
|
156 670 |
115 136 |
$\gt 99.9\% $ |
18 |
$\gt 99.9\% $ |
114 869 |
$98.4\% $ |
116 039 |
$\gt 99.9\% $ |
114 570 |
$\gt 99.9\% $ |
112 263 |
|
$N=6$, $L=6$
|
552 958 |
415 456 |
$\gt 99.9\% $ |
20 |
$\gt 99.9\% $ |
414 580 |
$74.6\% $ |
413 950 |
$99.7\% $ |
423 114 |
$\gt 99.9\% $ |
457 582 |
|
$N=6$, $L=8$
|
749 566 |
560 608 |
$\gt 99.9\% $ |
22 |
$\gt 99.9\% $ |
559 792 |
$48.3\% $ |
671 104 |
$\gt 99.9\% $ |
560 744 |
$99.6\% $ |
566 450 |
| crowds |
$\mathit{TR}=5,\mathit{CS}=15$ |
592 060 |
472 368 |
$\gt 99.9\% $ |
15 504 |
$\gt 99.9\% $ |
385 988 |
$91.9\% $ |
592 060 |
$\gt 99.9\% $ |
9 442 |
$99.8\% $ |
384 238 |
|
$\mathit{TR}=5$, $\mathit{CS}=20$
|
2 061 951 |
1 745 849 |
$\gt 99.9\% $ |
53 130 |
$\gt 99.9\% $ |
1 446 459 |
$92.4\% $ |
2 007 425 |
$99.8\% $ |
24 539 |
$\gt 99.9\% $ |
220 151 |
|
$\mathit{TR}=6$, $\mathit{CS}=10$
|
253 525 |
227 271 |
$\gt 99.9\% $ |
8 008 |
$\gt 99.9\% $ |
180 488 |
$85.5\% $ |
320 750 |
$\gt 99.9\% $ |
4 433 |
$\gt 99.9\% $ |
72 889 |
|
$\mathit{TR}=6$, $\mathit{CS}=15$
|
2 464 168 |
1 844 704 |
$\gt 99.9\% $ |
54 264 |
$\gt 99.9\% $ |
1 498 636 |
$87.2\% $ |
2 365 053 |
$\gt 99.9\% $ |
21 204 |
$\gt 99.9\% $ |
364 698 |
| Nand |
$N=20$, $K=4$
|
171 542 |
29 276 |
$\gt 99.9\% $ |
21 |
$\gt 99.9\% $ |
40 |
$94.9\% $ |
46 098 |
$99.9\% $ |
7 164 |
$99.8\% $ |
41 718 |
|
$N=30$, $K=2$
|
681 362 |
128 079 |
$\gt 99.9\% $ |
31 |
$\gt 99.9\% $ |
56 |
$51.8\% $ |
490 409 |
$\gt 99.9\% $ |
14 802 |
$85.7\% $ |
158 627 |
|
$N=30$, $K=3$
|
1 020 152 |
128 079 |
$\gt 99.9\% $ |
31 |
$\gt 99.9\% $ |
56 |
$43.9\% $ |
588 956 |
$93.9\% $ |
20 628 |
$99.8\% $ |
109 731 |
| brp |
$N=1024$, $\mathit{Max}=2$
|
43 013 |
4 105 |
$97.4\% $ |
2 052 |
$97.4\% $ |
4 120 |
$99.1\% $ |
7 560 |
$99.4\% $ |
126 |
$99.4\% $ |
352 |
|
$N=2048$, $\mathit{Max}=2$
|
89 021 |
8 201 |
$\gt 99.9\% $ |
12 291 |
$94.8\% $ |
8 216 |
$\gt 99.9\% $ |
6 144 |
$\gt 99.9\% $ |
270 |
$94.8\% $ |
8 205 |
Table
1 presents our experimental results showing for each case study: model name and parameter values, total number of states (
$|S|$), non-goal-reaching states (
$|{S^{0}}|$) as reported by PRISM. Moreover, the number of approximated
${S^{0}}$ states (
$|\approx {S^{0}}|$) by our learning methods, as well as the correctness in computing reachability probability for the initial state are reported. The correctness metric for reachability probabilities is defined as
$\frac{{\tilde{x}_{{s_{0}}}}}{{x_{{s_{0}}}}}$, where
${x_{{s_{0}}}}={P_{=?}}(\Psi )$ represents the actual probability and
${\tilde{x}_{{s_{0}}}}$ denotes the computed value using decision trees for
${S^{0}}$ and
${S^{?}}$ labelling. We evaluate three versions of Algorithm
3: V1: original algorithm without
${S_{sam}^{0}}$ or
${S_{sam}^{?}}$ additions, V2: adds randomly selected states to
${S_{sam}^{?}}$, and V3: adds goal-reaching path states.
Algorithms
1 and
2 yield more conservative results for the first three model classes. For these cases and PCTL properties, the models contain several small BSCCs where the constructed decision trees properly detect the states within them. The main limitation of these approaches (Algorithms
1 and
2) is its tendency to overfit to the states in BSCCs, resulting in poor generalization to other parts of
${S^{0}}$. While this limitation affects the simulation run length, it maintains high precision in BSCC detection and run termination. For the last class of case studies (
brp), the computational correctness varies across models depending on the parameter values. The first two algorithms suffer from overfitting, incorrectly labelling some unseen states in
$S\setminus {S^{0}}$ as belonging to
${S^{0}}$.
The experimental results demonstrate that in most cases, balancing the sizes of
${S_{sam}^{0}}$ and
${S_{sam}^{?}}$ generally improves the computational correctness of Algorithm
3. While the first version of this algorithm (denoted as (V1) in Table
1) performs poorly and achieves correctness less than 95% in nine cases, there is only one case for version 2 and two cases for version 3 that the correctness is less than 95%. Notably, the main purpose of presenting the second and the third versions of Algorithm
3 is to cover its weaknesses in providing some results that are far from the exact value.
In addition to computational correctness, we report the number of detected states for
${S^{0}}$ (denoted by
$|\approx {S^{0}}|$) for each approach. For
egl and
crowds models, the values of
$|\approx {S^{0}}|$ are not far away from
$|{S^{0}}|$. For the
crowds models, Algorithm
2 demonstrates the best performance in detecting a substantial portion of
$|{S^{0}}|$ while maintaining computational correctness. For these models and other model classes, Algorithm
3 (V1) labels the widest range of states as
$|{S^{0}}|$ and achieves the highest
$|\approx {S^{0}}|$ among all proposed approaches.
In the case of
Nand models, Algorithms
1 and
2 overfit to the training set and detect only a small portion of states in
${S^{0}}$. For these models, the first and second versions of Algorithm
3 label an excessive number of states as
${S^{0}}$, which degrades computational correctness. Similarly, in
egl models, Algorithm
1 exhibits overfitting.
Table 2
Correctness of the proposed statistical approach for bounding simulation runs with $\epsilon =0.01$.
| Model name |
Parameter values |
μ |
σ |
$ub$ |
Correctness |
| egl |
$N=5$, $L=6$
|
94.5 |
34.5 |
440 |
$\gt 99.9\% $ |
|
$N=5$, $L=8$
|
123.5 |
44.5 |
568.5 |
$\gt 99.9\% $ |
|
$N=6$, $L=6$
|
117.9 |
41.3 |
531.5 |
$\gt 99.9\% $ |
|
$N=6$, $L=8$
|
152 |
53.5 |
687 |
$\gt 99.9\% $ |
| crowds |
$\mathit{TR}=5$, $\mathit{CS}=15$
|
74.4 |
22.3 |
297 |
$\gt 99.9\% $ |
|
$\mathit{TR}=5$, $\mathit{CS}=20$
|
78.2 |
27.1 |
349 |
$\gt 99.9\% $ |
|
$\mathit{TR}=6$, $\mathit{CS}=10$
|
86.5 |
22.6 |
313 |
$\gt 99.9\% $ |
|
$\mathit{TR}=6$, $\mathit{CS}=15$
|
87.3 |
26.4 |
351 |
$\gt 99.9\% $ |
| Nand |
$N=20$, $K=4$
|
721 |
0.05 |
721.5 |
$99.6\% $ |
|
$N=30$, $K=2$
|
601 |
0.04 |
601.4 |
$99.6\% $ |
|
$N=30$, $K=3$
|
841 |
0.08 |
841.8 |
$99.7\% $ |
| brp |
$N=1024$, $\mathit{Max}=2$
|
6219 |
155 |
7769 |
$\gt 99.9\% $ |
|
$N=2048$, $\mathit{Max}=$2 |
12429 |
454 |
16968 |
$\gt 99.9\% $ |
To evaluate the impact of the statistical approach for terminating simulation runs, we apply the method proposed in Section
4.3 to the selected case study models. The results are presented in Table
2. For each model, we execute 10% of the required simulations for the training step. The average and standard deviation of the lengths for each model are reported in the table. Based on the relation proposed in Section
4.3 and to bound the errors to less than 0.01, we determine the computed upper bounds for the length of remaining simulation runs. In three classes of models (
egl,
crowds, and
brp), the computed upper bounds for terminating simulation runs are completely perfect and correctly terminate runs that have reached
${S^{0}}$. For the
Nand models, the correctness exceeds 99.6%, indicating near-perfect run termination.
5.3 Stability Analysis
Statistical model checking reliability requires multiple runs to evaluate result stability, defined as the consistency of results across multiple experimental runs. This evaluation employs
t-test and Levene’s test (Milton and Arnold,
2002), where Levene’s test examines variance equality, and
t-test verifies mean equality between independent result groups.
In the evaluation process, the proposed algorithms were independently executed 30 times and the probability of reachability properties was computed for a single instance, i.e. in the case of
egl,
$N=6$, and
$L=8$. Algorithm
3 (V3) and the instance were selected as a representative example of all possible parameters and instances. The experimental runs were subsequently partitioned into two distinct groups, each comprising 15 independent runs. Table
3 represents the stability analysis of the experimental results.
Table 3
The stability analysis of the experimental results for the significance level of 0.01.
|
|
Levene’s test for equality of variances |
T-test for equality of means |
|
|
Sig. |
Sig. (2-tailed) |
99% Confidence interval of the difference |
|
|
Lower |
Upper |
| Correctness |
Equal variances assumed |
0.598 |
0.520 |
$-0.004095$ |
0.002533 |
| Equal variances not assumed |
– |
0.520 |
$-0.004096$ |
0.002534 |
The Levene’s test significance value (Sig. $=0.598$) exceeds 0.01, confirming equal variance assumption. The t-test significance (0.520) and the difference interval containing 0 indicate no statistically significant differences between group means at 99% confidence, confirming result stability in the sense that the number of runs is sufficient.
5.4 Comparison with the State-of-the-Art Methods
Based on their superior performance and robustness within the DTMC models for the SMC task, we selected Algorithm
2 and Algorithm
3 (V3) as the basis for our comparison. In the literature, SimTermination and SimAnalysis developed by Younes
et al. (
2010), SimAdaptive by Daca
et al. (
2017), and the method developed by Brázdil
et al. (
2015) represent the state-of-the-art. Since the approach by Daca
et al. has been shown to dominate these existing methods (Daca
et al.,
2017), we evaluate it against our proposed algorithms. The results, obtained using a machine equipped with a 1.8 GHz Intel(R) Core(TM) i7-10510U processor and 12 GB of RAM running Ubuntu 24.04, are presented in Table
4. In the table, the running time in seconds, memory consumption for both the training step and the active simulation phase of SMC process, and the number of detected
${S^{0}}$ states are reported.
Table 4
Runtime, memory consumption, and the number of detected ${S^{0}}$ states for the selected DTMC models of our approaches in comparison with state-of-the-art.
| Model name |
Parameter values |
Algorithm2
|
Algorithm3 (V3) |
SimAdaptive(Daca et al., 2017) |
| Time |
Memory |
Time |
Memory |
$|\approx {S^{0}}|$ |
Time |
Memory |
| egl |
$N=5$, $L=6$
|
7.6 |
8.4 MB |
8.3 |
539 KB |
523 |
27.6 |
32KB |
|
$N=5$, $L=8$
|
13 |
9.7 MB |
28.9 |
752 KB |
522 |
38.4 |
43 KB |
|
$N=6$, $L=6$
|
9.5 |
12 MB |
19.6 |
728 KB |
1428 |
36.8 |
39.6 KB |
|
$N=6$, $L=8$
|
15.5 |
17 MB |
29.4 |
921 KB |
1391 |
49 |
50.7 KB |
| crowds |
$\mathit{TR}=5$, $\mathit{CS}=15$
|
1.5 |
653 KB |
1.6 |
67 KB |
603 |
2 |
9.3 KB |
|
$\mathit{TR}=5$, $\mathit{CS}=20$
|
1.5 |
643 KB |
1.7 |
57 KB |
835 |
2 |
9.4 KB |
|
$\mathit{TR}=6$, $\mathit{CS}=10$
|
2.5 |
955 KB |
2.9 |
114 KB |
528 |
3.3 |
11 KB |
|
$\mathit{TR}=6$, $\mathit{CS}=15$
|
2.6 |
6.5 MB |
3 |
788 KB |
845 |
3.3 |
11.2 KB |
| Nand |
$N=20$, $K=4$
|
10.9 |
489 KB |
10.5 |
293 KB |
18 |
11.1 |
23 KB |
|
$N=30$, $K=2$
|
10 |
135 KB |
6.5 |
157 KB |
18 |
10.3 |
19 KB |
|
$N=30$, $K=3$
|
13.7 |
335 KB |
3.5 |
244 KB |
19 |
14 |
26 KB |
| brp |
$N=1024$, $\mathit{Max}=2$
|
42.8 |
472 KB |
32 |
424 KB |
136 |
43.4 |
446 KB |
|
$N=2048$, $\mathit{Max}=2$
|
184 |
867 KB |
37 |
23 KB |
257 |
188 |
905 KB |
As demonstrated in Table
4, in most cases, our proposed approaches run significantly faster than the state-of-the-art SimAdaptive method while maintaining the computational correctness of the results. For example, in the case of
egl for
$N=5$ and
$L=6$, Algorithm
2 and Algorithm
3(V3) consume 7.6 and 8.3 seconds, respectively, while SimAdaptive takes three to four times longer. For the
Nand and
brp cases, Algorithm
3(V3) works faster than Algorithm
2. Compared to SimAdaptive, our proposed framework labels a significantly larger portion of states as belonging to
${S^{0}}$ without dramatically degrading the precision of reachability probabilities. This ability to identify a wider range of
${S^{0}}$ states leads to the early termination of simulation runs that would otherwise enter BSCCs. In contrast, SimAdaptive detects only a small part of
${S^{0}}$ and continues simulations on many states that actually have zero probability of reaching a goal state. For example, in the case of
egl for
$N=5$ and
$L=8$, Algorithm
2 and Algorithm
3(V3) detect 114 869 and 112 263 states of 115 136 states in
${S^{0}}$, while SimAdaptive detects 522 states. Regarding resource efficiency, the memory consumption of our compared approaches remains highly manageable, with usage in most cases staying below 1MB.