3.1 Presentation of Two Versions of the Proposed Algorithms

Fig. 1
General principle of the new algorithm. In this figure $\text{C}(i)$ is the set of all candidates itemsets of size i and $\text{CF}(i)$ the set of all frequents itemsets of size i.
Like most pattern discovery algorithms following the Apriori principle, the Graank algorithm performs a breadth-first traversal of the search space to identify frequent gradual itemsets. In this algorithm, the generation of candidate itemsets of size $(k+1)$ is performed by exploiting only gradual and frequent itemsets of size k. To do this, two frequent itemsets of size k are concatenated to form a candidate of size $(k+1)$ if they have $(k-1)$ items in common. The size of a candidate determines his level. For example, if ${A^{+}}{S^{+}}$ and ${A^{+}}{C^{+}}$ are frequent, we concatenate them to obtain the candidate ${A^{+}}{S^{+}}{C^{+}}$ of size 3, i.e. of level 3. At the k-th iteration, the algorithm generates all candidates of size k from the frequent itemsets of size $k-1$, then searches for candidates that are frequent. If $k=1$, the set of candidates is equal to the set of items. The algorithm stops as soon as no frequent itemset is found in an iteration. When $k\geqslant 2$, the k-th iteration requires a memory area to store all frequent itemsets of level $k-1$ with their binary matrix and frequent candidates of level k with their binary matrix. The cost of this storage can be important. The binary matrix of a candidate of size k is obtained as a product of two matrices from two frequent itemsets of size $k-1$. Thus, the binary matrix of a candidate of size k is obtained from $(k-1)$ matrix products. Because of this, the total cost of matrix products can significantly increase the running time of Graank.
The new algorithm presented here attempts to correct Graank’s weaknesses. Compared to Graank, at iteration
$k\geqslant 3$, it reduces the memory space needed to store the itemsets used to generate the candidates and the frequent itemsets discovered at iteration
k that will be used at iteration
$(k+1)$ to generate the candidates. Moreover, it significantly reduces the number of matrix products needed to determine the binary matrix of a candidate. The first difference between Graank and the proposed algorithm concerns the way candidates are generated. All candidates from size
$(k+1)$ up to size
$2k$ are generated only from gradual and frequent itemsets of size
k. In this candidate generation approach, two frequent itemsets of size
k and having
p items in common make it possible to generate a candidate of size
$(2k-p)$, between
$2k+1$ and
$2k$. The second difference comes from the processing carried out at each iteration as illustrated in Fig.
1. However, we start by searching for all the frequent items before entering the iteration phase. As in Graank, the iterations are numbered starting from 1. However, during iteration
$k\geqslant 1$, instead of searching for all the frequent itemsets of level
k as in Graank, we rather search for all the frequent itemsets of the levels
${2^{k-1}}+1$ up to
${2^{k}}$. In the first version of the algorithm, at the
k-th iteration, the algorithm generates all candidates whose size is between
$({2^{k-1}}+1)$ to
${2^{k}}$ only from the frequent itemsets of sizes
${2^{k-1}}$ obtained at the
$k-1$ iteration. Then, it goes through all the candidates in ascending order of their level, from level
$({2^{k-1}}+1)$ to
${2^{k}}$, to select those who are frequent. Like Graank, the proposed algorithm performs a breadth-first traversal of the search space and stops as soon as it encounters the first level which does not contain a frequent candidate. In the second version of the algorithm, at the
k-th iteration, the algorithm does not simultaneously store all the candidates whose level varies from
$({2^{k-1}}+1)$ to
${2^{k}}$. It performs an internal iteration which at each step generates a candidate, calculates its binary matrix and determines if it is frequent. The second version of the algorithm stops at the beginning of iteration
k if there is no frequent itemset of size
${2^{k-1}}$.

Algorithm 1
First version of the improved algorithm

Algorithm 2
Second version of the improved algorithm
Algorithms
1 and
2 describe two versions of the improved algorithm. They take as input a database and a minimal support provided by the user, and return the set of frequent gradual patterns. Table
4 describes the list of symbols used in the algorithms. As in Graank, both versions of the proposed algorithm exploit the notion of complementarity to reduce the search space by half. To do this, only itemsets with a positive first term are handled. Both versions search and store all frequent itemsets in main memory. Both versions of the algorithm can be modified in the following way to display each frequent itemset immediately after its discovery and in such a way as to reduce the consumption of main memory:
-
– Display each frequent itemset after its discovery.
-
– Store each new frequent itemset whose size is a power of 2, i.e. of the form ${2^{k}}$, in ${F_{{2^{k}}}^{+}}$.
-
– Free the memory space occupied by ${F_{{2^{k-1}}}^{+}}$ as soon as the construction of ${F_{{2^{k}}}^{+}}$ is finished.
There are two main differences between Algorithms
1 and
2. The first difference comes from the management of the memory space used to store the candidates of an iteration. During an iteration, Algorithm
1 generates the set of all candidates and selects the frequent ones while Algorithm
2 generates the first candidate and keeps it if it is frequent, then generates the second candidate and retains it if it is frequent, and so on. Thus Algorithm
1 requires an additional memory space to store all the candidates of each iteration while Algorithm
2 does not need such memory space. The key idea of Algorithm
2 is to save memory. The second difference comes from their termination criteria. Algorithm
1 could terminate during one iteration without computing the supports of all candidates in said iteration while Algorithm
2 cannot do this because it calculates the supports of all the candidates of an iteration. As a result, Algorithm
1 generally finishes before Algorithm
2.
Notation |
Signification |
$\mathcal{D}$ |
Database |
${F_{k}^{+}}$ |
Set of frequent itemsets of sizes k whose first term is positive |
${C_{k}^{+}}$ |
Set of candidate itemsets of size k whose first term is positive |
m |
Number of database attributes |
Threshold |
Minimal support given by the user |
3.2 Complexity Analysis
This section studies the time and memory complexities of Graank and the proposed algorithms.
Consider a candidate
c of level
$l={2^{k-1}}+p$ for some
p and some
k, corresponding to the
k-th iteration of the new algorithm, such that
$1\leqslant p\leqslant {2^{k-1}}$. In Graank,
c is derived from the concatenation of two frequent itemsets of size
$(l-1)$, which in turn are each derived from the concatenation of two frequent itemsets of size
$(l-2)$. By transitivity,
c comes from the concatenation of
${2^{p}}$ frequent itemsets of size
${2^{k-1}}$. The matrix of
c comes from a product of two matrices of level
$(l-1)$, and each of them comes in turn from two matrices of level
$(l-2)$ and so on. On the other hand, in both versions of the proposed algorithm,
c is derived from the concatenation of two frequent itemsets of size
${2^{k-1}}$. Moreover, the matrix of
c is issued from the product of two matrices of frequent itemsets of level
${2^{k-1}}$. The number of matrix products performed between the first and the
k-th iteration for the computation of the matrix of
c is
k. Therefore, generating candidate
c requires
$(l-1)$ itemset concatenations and
$(l-1)$ matrix products in Graank, while
k itemset concatenations and
k matrix products are needed in the proposed algorithms. This leads to Lemma
1.
Lemma 1.
The generation of a candidate of level $l={2^{k}}+p$, for some p and k, with its matrix requires $(l-1)$ itemset concatenations and $(l-1)$ matrix products in Graank, while k itemset concatenations and k matrix products are needed in the proposed algorithms.
Lemma
1 shows that the CPU cost of generating a candidate and computing its binary matrix is significantly improved.
Lemma 2.
Denote $TC$, $T{C_{1}}$ the time complexity of Graank and Algorithm
1 and 2 respectively. We have
and
where ${C_{l}^{g}}$ (
resp. ${C_{l}}$)
is the set of candidates of size l generated by Graank (
resp. the proposed algorithms)
.
Proof.
In (
4) and (
5), the first expression is the time complexity of the generation of candidates. The second expression is time complexity of the computation of binary matrices of candidates. □
Lemma
2 shows that, compared to Graank, the overall candidate generation runtime in the proposed algorithms is significantly improved. The overall runtime of the calculation of matrix multiplications is not improved although the execution time for the calculation of the matrix of a single itemset is significantly improved according to Lemma
1.
Denote by
$MS(k)$,
$M{S_{1}}(k)$ and
$M{S_{2}}(k)$ the memory space required by the
k-th iteration respectively in Graank, in the first and second versions of the proposed algorithm.
$MS(k)$,
$M{S_{1}}(k)$ and
$M{S_{2}}(k)$ depend on the storage space of candidate itemsets and frequent itemsets. Here, we evaluate
$MS(k)$,
$M{S_{1}}(k)$ and
$M{S_{2}}(k)$. In Graank, at iteration
k, after generating a candidate of size
k, its gradual support is calculated to know if it is frequent or not before generating the next candidate. Any frequent candidate discovered at iteration
k is stored in
${F_{k}^{+}}$. Thus, in Graank, we have:
In (
6), the first expression is the memory space required to store all frequent itemsets discovered between levels 1 and
$(k-2)$, without storing their matrices. The second expression is the memory space required to store all the frequent itemsets discovered at level
$(k-1)$, with their matrices. All frequent itemsets of size
$(k-1)$ and their matrices are used to generate all the candidates of size
k and their matrices respectively. The third expression is the memory space required to store the current candidate and its matrix, i.e. the one being processed. The fourth expression is the memory space required to store all frequent itemsets of size
k and their matrices.
In the first version of the proposed algorithm, iteration
k starts by generating and storing all candidates whose size is between
${2^{k-1}}+1$ and
${2^{k}}$. Then, it performs a breadth-search of all candidates, i.e. level by level. The gradual support of the current candidate, i.e. the one being processed, is calculated to know if it is frequent or not. A frequent candidate of size
l is stored in
${F_{l}^{+}}$ without its matrix if
$l\lt {2^{k}}$ and with its matrix otherwise. Once all the candidates of the same level have been processed, their memory space is freed and used to store frequent itemsets. Thus, in the first version of the proposed algorithm, we have:
This is equivalent to:
In (
8), the first expression is the memory space required to store all frequent itemsets discovered between levels 1 and
${2^{k-1}}-1$, without storing their matrices. The second expression is the memory space required to store all the frequent itemsets discovered at level
${2^{k-1}}$ with their matrices. All frequent itemsets of size
${2^{k-1}}$ and their matrices are used to generate all candidates whose size is between
${2^{k-1}}+1$ and
${2^{k}}$ with their matrices respectively. The third expression is the memory space required to store the current candidate and its matrix, i.e. the one being processed. The fourth expression is the memory space needed to store all candidates of unprocessed levels and all frequent itemsets whose size is between
${2^{k-1}}+1$ and
${2^{k}}$ without their matrices. The fifth expression is the memory space required to store all the matrices of frequent itemsets of size
${2^{k}}$.
In the second version of the proposed algorithm, at iteration
k, after generating a candidate whose size is between
${2^{k-1}}+1$ and
${2^{k}}$, its support is calculated to know if it is frequent or not before generating the next candidate. If the current candidate, i.e. the one being processed, is frequent and of size l, it is stored in
${F_{l}^{+}}$ without its matrix if
$l\lt {2^{k}}$ and with its matrix otherwise. Thus, in the second version of the proposed algorithm, we have:
This is equivalent to:
In formula (
10), the first, second and third expressions have the same meaning as their counterpart in formula (
8). The fourth expression is the memory space needed to store all frequent itemsets whose size is between
${2^{k-1}}+1$ and
${2^{k}}-1$ without their matrices. The fifth expression is the memory space required to store all frequent itemsets of size
${2^{k}}$ and their matrices.
Lemma 3.
We have $M{S_{2}}(k)\lt M{S_{1}}(k)$ and $MS({2^{k}})=M{S_{2}}(k)+(|{F_{{2^{k}}-1}^{+}}|-|{F_{{2^{k-1}}}^{+}}|){n^{2}}$.
Proof.
The first, second, third and fifth expression of formula (
9) are respectively equal to their counterpart in formula (
11). The fourth expression of formula (
11) is less than its counterpart in formula (
9). Thus, we have
$M{S_{2}}(k)\lt M{S_{1}}(k)$. The calculation of
$MS({2^{k}})-MS(k)$ using formulas (
12) and (
7) leads to
$MS({2^{k}})=M{S_{2}}(k)+(|{F_{{2^{k}}-1}^{+}}|-|{F_{{2^{k-1}}}^{+}}|){n^{2}}$. □
Lemma 4.
Denote by ${k_{2}}$, $kg1$ and $kg2$, respectively, the iterations for which $S{M_{2}}(k)$, $SM(k)$and $SM({2^{k}})$ reach their maximum value. Denote by $\textit{SMC}$, ${\textit{SMC}_{1}}$ and ${\textit{SMC}_{2}}$ the spaces memory complexity required by Graank, Algorithm
1 and 2, respectively. We have:
Proof.
(
14) and (
15) are straightforward from the respective definitions of
$SM(k)$,
$SM1(k)$ and
$SM2(k)$. (
16) is straightforward from equation
$MS2(k)\lt MS1(k)$ established in Lemma
3. (
17) and (
18) are straightforward from equation
$MS({2^{k}})=M{S_{2}}(k)+(|{F_{{2^{k}}-1}^{+}}|-|{F_{{2^{k-1}}}^{+}}|){n^{2}}$ established in Lemma
3. □
Lemma
4 shows that the first version of the proposed algorithm consumes more memory space than the second one which in turn can consume less or more memory space than Graank under certain constraints. However, the first version may stop one iteration before the second. Thus, the first version stops faster than the second. The second version stops at the start of an iteration that has no candidate. Such an iteration
k implies that there is no frequent itemset of size
${2^{k-1}}$. The first version stop during an iteration whose some level has no frequent itemset. Such iteration
k implies that there is no frequent itemset of size
${2^{k-1}}+p$,
$1\leqslant p\leqslant {2^{k-1}}$, for some
p.