2.1 Approach Description
The study considers the classification problem of the following kind (Kuzmich and Masich,
2014). There is a data set consisting of two disjoint sets
${\Omega ^{+}}$ and
${\Omega ^{-}}$ of
n-dimensional vectors belonging to the positive and the negative class, respectively. The components of the vectors, also called attributes, can be both numeric (nominal) and binary (Stupina
et al.,
2012). The task is to subsume a certain new observation, also a vector of
n variables, under the appropriate class.
The suggested data classification approach is based on the method originating from the theory of combinatorial optimization, which is called
Logical Analysis of Data (LAD) (Hammer and Bonates,
2005). This method has been usefully employed in solving a range of problems in various fields (Kuzmich and Masich,
2012; Hammer
et al.,
2004a,
2004b; Herrera and Subasi,
2013). The key idea of the method is to apply a combination of “differentiation” and “integration” actions to a section of the space of original attributes containing the given positive and negative observations. The “differentiation” stage involves defining a family of small subsets sharing characteristic positive and negative features. At the “integration” stage, the unions of these subsets, created in a specific manner, are treated as the approximations of certain areas of the space of attributes consisting of positive and, consequently, negative observations (Kuzmich and Masich,
2014).
The sequence of steps for this method is here (Hammer and Bonates,
2005):
a) To remove redundant variables in the original data set, a subset S is singled out from the set of variables to help to distinguish positive observations from the negative ones. The further steps of the method utilize the projections ${\Omega _{s}^{+}}$ and ${\Omega _{s}^{-}}$ of the sets ${\Omega ^{+}}$ and ${\Omega ^{-}}$ on S.
b) The Ω${_{s}^{+}}$ set is covered with a family of similar subsets of a smaller space, each of which significantly overlaps with ${\Omega _{s}^{+}}$, but does not overlap with ${\Omega _{s}^{-}}$; alternatively, a minor overlapping with ${\Omega _{s}^{-}}$ is acceptable if it results in a greater overlapping with ${\Omega _{s}^{+}}$. Such subsets are called “positive patterns.” In a similar fashion, the ${\Omega _{s}^{-}}$ set is covered with “negative patterns.”
c) Then it is necessary to identify the subset of positive patterns whose union covers all ${\Omega _{s}^{+}}$ observations and the subset of negative patterns whose union covers all ${\Omega _{s}^{-}}$ observations.
d) The fact of whether a certain observation is covered by the union of the two subsets, which are either positive or negative, is then determined using a classifier built on these subsets.
2.2 Binarization of Attributes
The studied method is intended for the use against data sets of binary attributes. Since the original data set can include attributes of various types, it is necessary to binarize them.
One of the simplest binarization methods suggests linking each metric variable to a number of binary variables. A binary variable is assigned 1 if the value of the corresponding metric variable exceeds a certain threshold value, and vice versa. This method is referred to in Rastrigin and Freymanis (
1988) as “unitary”. Its flaw lies in the fact that it implies having numerous combinations of binary variables that cannot be linked to any points in the original space
$(2n-n-1)$. This flaw makes it difficult to use this method for coding the variable arguments of criterion functions when solving optimization problems, as it will generate a great number of invalid solutions. However, in this case, it does not matter as long as classification is concerned, because the binary variables are obtained by coding the predefined metric variables. The main advantage of this method though is the fact that the distances across the original and binary spaces are equal. It means that points closely spaced in the original space will also stay in proximity of each other in the binarized space. This, in its turn, makes it possible, as early as at the binarization stage, to minimize the number of thresholds by mapping close values of the original variable with the equivalent values within the binary space (provided that the positive and negative subsets of observations remain disjoint) (Hammer and Bonates,
2005).
Also there exists another binarization method referenced in Vorontsov (
2010).
An arbitrary attribute
$f:X\mapsto {D_{f}}$ creates terms verifying that the value of
$f(x)$ falls into certain subsets of the
${D_{f}}$ set. Some typical structures of this kind are provided in Vorontsov (
2010).
-
– If
f is a nominal attribute:
-
– If
f is an ordinal or quantitative attribute:
For a quantitative attribute
$f:X\to R$, it is necessary to only consider those threshold values
d that divide the
${X^{\ell }}$ set in different ways. After excluding trivial dissections converting
$\beta (x)$ to 0 or 1 across the whole set, the remaining number of such values will not exceed
$\ell -1$. For instance, it is possible to take thresholds of the following kind:
where
${f^{(1)}}\leqslant \cdots \leqslant {f^{(\ell )}}$ is a sequence of values of the
f attribute throughout the observations of the set
$f({x_{1}}),\dots ,f({x_{\ell }})$, sorted in ascending order.
Should the resulting terms be later intended for the synthesis of conjunctions, it is recommended to pick the most informative ones right away, to cut down on the iterations of sequential search. With ordinal and quantitative attributes, such problem is solved through the optimal partitioning of the range of attribute values into zones. The process of such partitioning is described below.
Suppose
$f:X\to R$ is a quantitative attribute,
${d_{1}},\dots ,{d_{r}}$ is an ascending sequence of thresholds. Let us define the zones containing the values of the
f attribute as terms of the following kind:
For example, a greedy algorithm of zone merging starts with dividing them into “small zones.” The thresholds are calculated according to formula (
1) and pass through all the pairs of points
${x_{i-1}}$,
${x_{i}}$, of which exactly one belongs to class
k.
The initial division comprises alternating zones defined as “only k – only not k”. Later the zones can be consolidated through merging triple points of adjacent zones. It is important to merge specifically triple points, since merging pairs will disrupt the alternation of “k – not k”, resulting in some “small zones” remaining unmerged in the end. The algorithm of merging zones stops when either of the following criteria is satisfied: a specific number r of zones has been reached; or certain original zones ${\varepsilon _{i-1}}$, ${\varepsilon _{i}}$ and ${\varepsilon _{i+1}}$ start containing more information than the corresponding merged zone ${\varepsilon _{i-1}}\vee {\varepsilon _{i}}\vee {\varepsilon _{i+1}}$. The three points to merge are selected so as to achieve the maximum gain in information content after the merger.
2.3 Building a Support Set
Representing an excessively large number of attributes in a set can be associated with an enormous computational load. This is the case, for example, in genomics and proteomics, the two most rapidly progressing areas of bioinformatics where the expression for the level of intensity of thousands, if not tens of thousands, genes or proteins is included into the data set, despite the fact that even the smallest subset of these attributes is sufficient to perform an excellent separation of positive and negative observations (Alexe
et al.,
2002). One of the factors that makes it more difficult to extract an informative subset of attributes is the fact that there is a pronounced difference between the information content of individual attributes and the information content of a set of attributes.
It is necessary to devise some approaches to the identification of a subset of attributes that can help separate, with a high degree of accuracy, the positive and negative observations.
One of such approaches based on the selection of a subset of attributes via building an optimization model in the form of a combinatorial optimization task is provided here.
A set S of attributes is called a support set if the projection ${\Omega _{s}^{+}}$ of the set ${\Omega ^{+}}$ on S does not intersect with the projection ${\Omega _{s}^{-}}$ of the set ${\Omega ^{-}}$ on S. The entire set of attributes is a support set since ${\Omega ^{+}}$ and ${\Omega ^{-}}$ originally do not intersect. A support set can be called minimal, when the elimination of any remaining variable from it leads to a data set in which some positive and negative observations are identical.
In order to find the minimal support set, one needs to assign to each attribute
${x_{i}}$,
$i=1,\dots ,t$ of the binary set a new binary variable
${y_{i}}$, which is equal to 1 if
${x_{i}}$ belongs to the support set, and to 0 otherwise. One denotes the binary vector associated with positive observations as
$U=({u_{1}},{u_{2}},\dots ,{u_{t}})$ and the one associated with negative observations as
$V=({v_{1}},{v_{2}},\dots ,{v_{t}})$. A new variable is then introduced:
The separability of the sets ${\Omega _{s}^{+}}$ and ${\Omega _{s}^{-}}$ is then conditioned by holding the inequation $\textstyle\sum {w_{i}}(U,V){y_{i}}\geqslant 1$ for any $U\in {\Omega _{S}^{+}}$ and $V\in {\Omega _{S}^{-}}$.
To ensure that the data set is more resistant to any errors occurring during the measurements which produce those data, this condition should be made stricter by replacing 1 in the right side of the inequation with a certain integer d. This means that the positive and negative observations should differ by at least d attributes.
Therefore, the problem of minimizing a support set can be formulated as a conditional pseudo-Boolean optimization problem:
where
$y\in {\{0,1\}^{t}}$.
The objective function of this problem is unimodal, monotonic, pseudo-Boolean function (Antamoshkin and Masich,
2007a,
2007b; Antamoshkin and Semenkin,
1998), i.e. it has a single absolute minimum located in the point
${y_{0}}=(0,0,\dots ,0)$ and its output increases as it gets further from the point of minimum (when any of its components changes from 0 to 1). The constraint function is also a unimodal, monotonic, pseudo-Boolean function, besides, it is defined using an algorithm, since its calculation requires iterating through all possible pairs of positive and negative observations.
An alternative approach to selecting the attributes is the specially designed algorithmic procedure, which is based on evaluating the importance of the given attributes and helps to obtain a reduced set (Kuzmich and Masich,
2014).
The importance of any attribute is estimated against the frequency of its inclusion into the patterns involved in the classifier (Brauner
et al.,
2004). Therefore, the more often the attribute is found in the resulting patterns, the more important it is. Those attributes that cannot be found or are rarely involved in building patterns are considered unimportant.
The algorithmic procedure for generating a reduced set of attributes consists of four stages:
The first stage of the procedure for generating a reduced set of attributes involves conducting a classification of the entire set of attributes in order to determine the importance of each attribute.
The second stage requires a researcher to set an importance threshold as a reference against which it is possible to assess the importance of an individual attribute.
The third stage is about sorting the attributes by their importance and identifying those attributes whose importance value turned out to be beneath the specified threshold.
The fourth stage consists in excluding the attributes singled out at the third stage from consideration. The remaining attributes will combine to the reduced set. In this way, by applying varied importance thresholds, the researcher can obtain different reduced sets of attributes, which can later be used to build patterns.
2.4 Building Patterns
The concept of patterns lies at the core of the reviewed approach. A positive pattern is defined as a subcube of a set of Boolean variables
${B_{2}^{t}}$ that intersects with the set
${\Omega _{s}^{+}}$ and does not share elements with the set
${\Omega _{s}^{-}}$. A negative pattern is formed in a similar fashion. A positive
a-pattern for
$a\in {\{0,1\}^{t}}$ is a pattern that contains point
a. For every point
$a\in {\Omega _{s}^{+}}$, let us find the maximal
a-pattern, i.e. the one covering the greatest number of points
${\Omega _{s}^{+}}$ (Kuzmich and Masich,
2014).
The corresponding subcube is defined using
${y_{j}}$ variables:
That is, by fixing l variables of the original cube with t dimensions, we obtain a subcube with $(t-l)$ dimensions and ${2^{t-l}}$ points.
The condition stipulating that a positive pattern should not contain any points from
${\Omega _{s}^{-}}$ demands that for each observation
$b\in {\Omega _{s}^{-}}$ the
${y_{j}}$ variable is equal to 1 at least for one
j, where
${b_{j}}\ne {a_{j}}$:
The limitation can be made stricter to help increase error resistance, in which case the number 1 in the right side of the inequation should be substituted for a positive integer d.
On the other hand, a positive observation
$c\in {\Omega _{s}^{+}}$ will only belong to the considered subcube where the
${y_{j}}$ variable is equal to 0 for all indices
j, where
${c_{j}}\ne {a_{j}}$. In this manner, the number of positive observations covered by the
a-pattern can be calculated using the following formula:
Therefore, the task of building patterns is reduced to a conditional pseudo-Boolean optimization problem with algorithmically defined functions (Bonates
et al.,
2006; Hammer
et al.,
2004a,
2004b; Hwang and Choi,
2015):
The objective function (
2) and the constraint function (
3) in this problem are both unimodal, monotonic pseudo-Boolean functions.
The task of finding the maximal negative patterns is solved in a similar fashion.
Each identified pattern is characterized by its coverage – the number of captured observations within the corresponding class, and its degree – the number of fixed variables that determine this pattern. According to the above optimization model (2)–(3), the resulting patterns do not cover any observations from the different class (from the training set).
The most valuable are the patterns that demonstrate the greatest coverage. The greater the coverage, the more adequately the pattern reflects the image of the class.
The particular nature of the classification problem described above is in the fact that the database has a large number of unmeasured values (omitted data), whereas the measurements that have been made may be inaccurate or erroneous. It is well known that errors directly depend on measurement accuracy indicating how close the measurement results are to the actual values of the measured entities. The measurement accuracy can be increased or decreased, depending on the allocated resources (cost of measurement tools, spending on the process of measurement, stabilizing the external environment, etc.). It is understood that it must be fit for the task at hand, but not necessarily be of superior quality, because a further increase in accuracy may lead to excessive financial expenditures (Boros
et al.,
2009).
Sets of quantitative data can have errors in the values of quantitative attributes because of imprecise tools, imperfect measurement methods or human errors. Noise and spikes can lead to observations from different classes “overlapping” with each other and getting in the “areas” of the opposite class. Consequently, the resulting patterns have a higher degree and a much lesser coverage than they would have had without those spikes and errors, while the classifier ends up consisting of a great number of small patterns (with little coverage). This prevents one from building an effective classifier with “well-interpreted” rules involving a small number of attributes and a high degree of classification accuracy.
To make the method more error-resistant, it is recommended to loosen the limitation described in (
3). This will reduce the number of calculated patterns and increase their coverage.
The limitation of the optimization model will then look in the following way (Kuzmich and Masich,
2014):
where
D is the number of observations of a different class that are allowed to be covered by the pattern (a non-negative integer).
The functions (
2)–(
4) of the created optimization model are defined using an algorithm, i.e. they are calculated over a specific sequence of operations. The optimization problem is solved using optimization algorithms based on looking for boundary points of the permissible region (Antamoshkin and Masich,
2006,
2007a,
2007b). Such algorithms were specially designed for this class of problems and are based on the behaviour of monotonic functions of the optimization model in the space of Boolean variables. The algorithms looking for boundary points are search algorithms, i.e. they do not require defining the functions explicitly, via algebraic expressions. Instead, they calculate the function outcome across a number of points.
According to the model
$(2,4)$, the most preferable patterns are the ones with the maximum coverage. Consequently, the patterns built in this way have a low degree, i.e. they consist of a small number of terms and use only a fraction of attributes. Low-degree patterns correspond to large areas in the space of attributes. This may lead to their covering some observations from a different class (missing in the training set) and the increased number of incorrectly classified observations. This characteristic feature affects the information content of the pattern towards reducing it. Therefore, to increase the information content, the authors suggest using an algorithmic procedure for aggregating patterns. It is applied to each created pattern by driving the degree of the said patterns to a maximum level while at the same time keeping their coverage intact:
where
$fc(Y)$ is the value of the objective function (coverage) for the pattern before the aggregation procedure,
$f{c^{\prime }}(Y)$ is the value of the objective function for the pattern after the aggregation procedure.
This way, the application of the pattern aggregation procedure can increase the information content of the patterns by reducing their coverage by the observation rules from the other class, thus driving up the accuracy of the decisions made by the classifier.
The next stage of this method is dedicated to solving the problem of building an adequate classifier that could classify any incoming observation, i.e. the observation that was not around when the classifier was being built.
2.6 Modifications to the Method of Logical Analysis of Data
Creating patterns and building a classifier are milestone stages of the method of logical analysis of data. The implementation of these stages is what directly determines the quality of the classification results. For that reason, the design of modifications to the method is associated with developing algorithmic procedures that address these stages.
So, at the pattern-creating stage, the suggested approach to defining the objective function for the optimization model is based on modifying the objective function (2) in order to emphasize the differences between the rules used in the classifier. This approach rests on the premise that the patterns to be voted should be different; otherwise they will serve no purpose for the classification.
According to the objective function (2), each created pattern maximizes its coverage by capturing observations typical for the corresponding class, whereas non-typical observations of the class remain uncovered, and the classifier does not comprise any patterns that take those into account. This way we obtain a set of similar patterns for the class, thus compromising the classification quality. To get a classifier with a higher distinction between the rules that allows allocating significantly different subsets of observations, the authors suggest introducing the following modification to the objective function (2) in order to identify positive patterns:
where
${K_{c}}$ is the weight of the positive observation
$c\in {\Omega _{s}^{+}}$, which decreases when this observation is covered, effectively lowering its participation priority in building the next pattern in favour of uncovered observations.
The objective function for the optimization model used to identify negative patterns is created in a similar fashion.
To be able to use the optimization model with the objective function (5) for building patterns, it is necessary to specify the initial weights for all observations and the rule for changing the weights of those observations that have participated in creating the current pattern. It is recommended to set the initial weights to 1 for each observation in a training set. Below is the rule for changing the weight of any observation that has already participated in creating the current pattern:
where
${K_{i}}$,
${K_{i+1}}$ are the weights of the observation that is being covered during the creation of the current and the next patterns,
${N_{\max }}$ is a researcher-specified parameter denoting the maximum number of patterns that can cover an observation from the training set in the classifier.
This way, using the optimization model with the objective function (5) to build patterns, one can come up with logical rules that cover significantly different subsets of observations. Later on, those of them that yield a positive outcome of the objective function are selected and aggregated in the classifier.
The next stage of the method is dedicated to solving the problem of building an adequate classifier that could correctly classify any incoming observation, i.e. the observation that did not take part in the creation of the classifier.
In view of a potentially large volume of the data set, a question arises as to the need of reducing the number of patterns, since this quantity in the original classifier is equal to the cardinal of the training data set $|{\Omega ^{+}}\cup {\Omega ^{-}}|$. In short, it is necessary to define a classifier consisting of a certain number of patterns in such a way that it would be capable of classifying the same observations that are possible to classify using a complete system of patterns.
This study offers the following algorithmic procedures for reducing the number of patterns in the original classifier:
-
– selecting baseline observations for building patterns (Kuzmich and Masich,
2014);
-
– building a classifier as a composition of informative patterns (Kuzmich and Masich,
2012).
The implementation of the algorithmic procedure of selecting baseline observations for building patterns involves completing a series of consecutive steps. First, based on the observations from the training set, one needs to derive centroids for each class by using the k-means algorithm. According to the k-means clustering algorithm, each observation from the training set has to be put into one of the k-clusters so that each cluster is represented by the centroid of the corresponding observations, whereby the distance from each observation to the centroid of its cluster is shorter that the distance to the centroids of any other cluster. This algorithm makes it possible to pick a range of centroids that most accurately represents the distribution of observations in the training set.
The algorithm comprises the following steps described in Bagirov (
2011):
Step 1. Pick k initial centroids ${z_{1}}(1),{z_{2}}(2),\dots ,{z_{k}}(l)$. The initial centroids are selected arbitrarily, e.g. the first k observations from the training set.
Step $\boldsymbol{l}$. At the
l-th step of the iteration, distribute the set of observations
$X=\{{x_{1}},{x_{2}},\dots ,{x_{m}}\}$ among
k clusters according to the following rule:
for every
$i=1,2,\dots ,k$,
$i\ne j$, where
${T_{j}}(l)$ is the set of observations belonging to the cluster with the centroid
${z_{j}}(l)$. In case of equality, the decision is made in arbitrary way.
Step $\boldsymbol{l}\boldsymbol{+}\mathbf{1}$. Based on the results of step l, new centroids of clusters ${z_{j}}(l+1)$, $j=1,2,\dots ,k$ are derived, on the assumption that the sum of squared distances between all observations belonging to the set ${T_{j}}(l)$ and the new centroid of this cluster must be minimal.
The centroid
${y_{j}}(l+1)$ ensuring the minimization
${J_{j}}={\textstyle\sum _{x\in {T_{j}}(l)}}\| x-{z_{j}}(l+1){\| ^{2}}$,
$j=1,2,\dots ,k$ is a sample average calculated across the set
${T_{j}}(l)$. Therefore, the new cluster centroids are defined as:
where
${N_{j}}$ is the number of sample observations included into the set
${T_{j}}(l)$. Apparently, the choice of the
k-means algorithm is due to the established way of sequential correction of the calculated cluster centroids.
The equation ${z_{j}}(l+1)={z_{j}}(l)$, given $j=1,2,\dots ,k$, is the condition for the convergence of this algorithm, and upon its achievement the execution of algorithm stops. The resulting sets ${T_{j}}(l)$, $j=1,2,\dots ,k$ will be the sought-for clusters. If this is not the case, the last step is repeated.
This algorithm is used to partition the observations of the training set of each class into clusters. It produces a separate set of centroids for each class.
Second, one needs to add the resulting sets of centroids to the observations in the training set. Third, the centroids are used as baseline observations for building patterns.
This way, by implementing the heuristic procedure described above, we get a new classifier consisting of a lesser number of patterns. The number of patterns in the classifier will be equal to the cumulative number of centroids obtained for each class. Clearly, the classification accuracy depends on the number of centroids for each class, therefore one needs to conduct multiple experiments with sets of centroids of diverse quantity in order to establish how the classification accuracy depends on the number of centroids for each class.
The procedure of selecting baseline observations for building patterns must be implemented prior to creating the classifier, effectively simplifying its creation due to the significant reduction of the number of patterns to be built, however, this will normally slightly degrade the classification accuracy. To mitigate this shortcoming, another approach can be used to reduce the number of patterns in the original classifier. It is necessary to build a classifier whose number of patterns is equal to the cardinal of the training data set, and to reduce this number of patterns while retaining the high accuracy of classification. This approach can be implemented through the suggested procedure of building a classifier as a composition of informative patterns, which is based on the concept of their information content.
There are several criteria for measuring the information content of a pattern offered in the discipline-specific literature. This study recommends using the boosting criterion, since it adequately assesses the information content of a pattern and is fairly simple to calculate:
where
p is the number of observations of own class captured by the created pattern;
n is the number of observations from other classes captured by the created pattern.
Initially, the classifier includes all patterns that are built against each observation in the training set. Consequently, as the volume of the training set increases, so does the size of the set of rules for the classifier. Notably, the created patterns are characterized by different information content. The patterns covering a small number of observations are statistically unreliable – they include too many patterns that make more mistakes with independent support data than with a training set. For that reason, it is recommended to only include informative patterns into the classifier, i.e. their information content must exceed a certain information threshold (${H_{0}}$) specified by the researcher. This will help to reduce the number of patterns in the classifier without compromising the classification accuracy or with only slight changes towards its improvement/deterioration.
The solving of this problem raises the issue of choosing the information threshold. This study addresses this issue through designing the following iterative procedure. The first step of this procedure suggests setting the information threshold to 0 for both positive and negative sets of patterns, thus resulting in the original classifier consisting of the maximum number of patterns possible. At the second step of this procedure, it is necessary to set the information threshold for negative (positive) patterns, which should be equal to the average information content (
H${_{avg}}$) across all negative (positive) patterns:
where
q is the number of negative (positive) patterns in the classifier,
${H_{i}}$ is the information content of the
i-th negative (positive) pattern calculated using the formula (
6).
To get a new classifier consisting of patterns with greater information content, we will remove from the original classifier all negative (positive) patterns whose information content is below the information threshold derived for them. Having calculated the values of the average information content for negative and positive patterns of the current classifier, we will use them to build the next classifier that will consist of patterns whose information content is higher than the values of the average information content for the current classifier. This way we will build each successive classifier, each time utilizing the average information content of the present one. This shortens the number of patterns and increases the average information content for each successive classifier. The procedure should stop as soon as the number of unclassified (uncovered) observations has increased during the classification process, i.e. the patterns included in the current classifier fail to cover certain observations belonging to the test sample. In this case, it is necessary to either get back to the previous classifier and reverse the two information threshold to their previous values, or change the value of only one information threshold for negative (positive) patterns and register how this amendment will affect the number of unclassified observations and the classification results in general.
Based on the designed algorithmic procedures, the authors suggest the following modifications to the method of logical analysis of data in order to improve the generalization capability of the classifier and make it more interpretable by reducing the number of rules it uses:
-
– using the objective function (5) and the constraint function (4) to create patterns and build the classifier exclusively on the rules that yield a positive (greater than zero) outcome of the objective function;
-
– using the algorithmic procedure for selecting baseline observations to create patterns and applying the aggregation procedure to the resulting rules;
-
– applying the algorithmic procedure of building a classifier as a composition of informative patterns based on the optimization model $(2,4)$ coupled with the aggregation procedure.
The suggested modifications to the method of logical analysis of data can help improve the quality of the classification of new observations.