1 Introduction
A similarity measure takes into consideration two real problems. In the first one, a reference object is observed. In the second one, the other object is in the focus. Basically, it is evaluated how much it matches the first object (Goshtasby and Ardeshir,
2012). This object is usually called the target object. Hence, a similarity measure quantifies how similar are these two objects (reference and target). Also, this includes the matching of parts of the given object.
Basically, the use of a specific similarity metrics deeply depends on the problem which has to be solved. In particular, it is defined by the goal of a measurement process. The final goal is the similarity level which is derived by the matching between a target object and a reference object. However, the metrics heavily depend on the characteristics of the objects to be compared.
Typically, we can consider the following similarity metrics: (i) mean squares, (ii) normalized correlation, (iii) pattern intensity, (iv) mutual information, and (v) self-similarity matrix.
Mean squares or Mean Square Error (MSE) uses the intensity values given by two objects. It is typically used as an evaluator in a one-dimensional problem. Basically, it evaluates the difference between the estimator and the reference by measuring the average of the squares of the errors, i.e. deviations between them. This fact implies that their values should be in the same range. In fact, MSE measures the level of estimator’s quality. It is worth noting that it always receives a positive value, and if it is closer to zero, then the estimation is better. However, in another way, MSE can be interpreted as a second moment of the error. Accordingly, it includes the variance of the estimator as well as its bias. If we have an unbiased estimator, then the MSE represents the variance of the estimator. In such cases, MSE can be used as a good approximation of the difference between estimator and reference. Still, it suffers from the same disadvantages as the variance if an estimator includes heavily weighted outliers. Due to calculating the squared difference, large error elements appear. In this case, the MSE approach can be troublesome in some types of applications.
Normalized correlation calculates the correlation between the intensity values that characterize the two objects. In this way, it establishes a similarity measure between them. Basically, both objects are represented as two different vectors. Consequently, this method is used for the template matching, which finds the patterns matching in both objects. The normalized correlation can receive values in the range between +1 and −1. Accordingly, a linear relationship between the objects is established. Hence, it also represents the two-dimensional version of the Pearson’s correlation coefficient. The advantage of the method is its easy implementation and computer time effectiveness. However, the main disadvantage of the method is its limitation to the only linear problems.
Pattern intensity uses the differences between intensity values transformed by a function which is usually given as $1/(1+x)$. After that, these differences are squared and then summed at the end. These metrics are simultaneously increased when more samples are accessible as well as when the intensity values are close to each other. It represents its advantage over some other metrics methods.
Mutual information includes the metrics which are based on the information theory postulate. Basically, it measures the dependence between two objects represented by variables. In this way, it quantifies how much one variable tells about the other one. This measure gives the amount of information obtained by one variable through the other variable. The initial concept of mutual information is connected to the entropy. Accordingly, it can be considered to be a reduction in uncertainty about one variable given the knowledge of another variable. The obtained value is a dimensionless quantity with units in bits. Zero values of the mutual information mean that two variables are mutually independent. On the contrary, a high value of mutual information demonstrates a high reduction in uncertainty, while a low value reveals a small reduction in uncertainty.
Self-similarity matrix graphically represents similar sequences in a specific data series. The data series typically are feature vectors characterized by an N length. Then, it computes a square self-similarity matrix of area $N\times N$ showing pairwise similarities or dissimilarities between vectors. Basically, the self-similarity matrices match a series of features to show similarities and differences between those features in the series. The diagonal elements correlate to the self-comparison, giving zero values in the diagonal. On the contrary, each element of the matrix is calculated according to a distance measure, such as the Euclidian distance (or some other) and the features. This allows the comparison of repetitive elements and segments. Higher scores are given to more similar elements (including repetitive elements), and lower or negative scores are characterized by dissimilar elements. If similar sequences representing repetitive elements exist, then diagonal stripes are shown in the plot.
At the end, it is worth noting that the choice of the right similarity measure depends on the objects that are compared.
The aforementioned measures as well as their extensions are used for the objects that are usually represented as bi-dimensional ones. Hence, these objects can be represented as matrices. In that sense, the similarity of two objects can be evaluated in two different circumstances: (i) the determination of the similarity between two matrices, and (ii) the determination of the similarity between a template matrix and a window in a larger matrix.
According to the previous categorization, many similarity measures have been proposed in the literature to evaluate the matching of the objects. Some of them are statistics-based: (i) Pearson-Correlation Coefficient (Pearson,
1896), (ii) Stochastic Sign Changed (Duda
et al.,
2000), (iii) Median of Square Differences (Lan
et al.,
1995), (iv) Intensity-Ratio Variance (Woods
et al.,
1992), (v) Intensity-Mapping-Ratio Variance (Hill
et al.,
1993), and (vi) Rank Distance (Goshtasby and Ardeshir,
2012). Also, some measures have geometrical properties, such as: (i) Tanimoto Measure (Theodoridis and Koutroumbas,
2009), and (ii) Normalized Square
${L_{2}}$ Norm (Evangelidis and Psarakis,
2008). Other measures are information-theoretic, such as: (i) Incremental Sign Distance (Kaneko
et al.,
2002), (ii) Joint Entropy (Shannon,
1948), (iii) Kullback-Leibler Divergence (Kullback and Leibler,
1951), (iv) Exclusive F-Information (Rougon
et al.,
2003), (v) Jensen-Shannon Divergence (Fuglede and Topsoe,
2004), and (vi) Arithmetic-Geometric Mean Divergence (Tourassi
et al.,
2007).
Lately, the similarity between matrices has been measured by Common Submatrix approaches typically on image examples with the introduction of the baseline
Average Common Submatrix measure (ACSM) (Amelio and Pizzuti,
2013,
2016), and its variant
Approximate Average Common Submatrix (A-ACSM) (Amelio,
2016), which overcome the limitations of the previously introduced (dis)similarity measures. They compute the average area of the largest square sub-matrices matching in two images for measuring their similarity. These methods are based on the principle that two matrices are more similar to each other if they share larger patterns. The application of the Common Submatrix approaches is predominantly linked to the images. However, they can be used in any other area of interest in signal processing where data can be represented in the matrix format.
The main limitation of the baseline ACSM measure is its computational complexity, which is critical when its brute-force algorithm is applied. Although a tree indexing strategy has been introduced in ACSM for reducing the temporal cost of the brute-force algorithm, the construction procedure of the tree is quite demanding in time and space. On the other hand, the A-ACSM measure reduces the temporal cost of the brute-force algorithm of ACSM by omitting the comparison of a portion of elements at regular intervals during the match of the sub-matrices. Its main advantage is that it does not need any additional tree indexing data structure to be constructed and kept in memory. However, setting the correct size of the interval is a critical aspect which may compromise the similarity computation.
In this paper, we propose an optimized version of the Common Submatrix approaches for measuring the similarity. The new measure only considers the match of those sub-matrices which are located at the same position in the two matrices. Also, the new measure exactly matches the sub-matrices because every element of the first sub-matrix is compared with its corresponding element of the second sub-matrix. Hence, any element is omitted from the comparison during the match. In this way, an improvement in time complexity of the baseline ACSM method is achieved by including all elements in the comparison of the sub-matrices, with no need of constructing a tree indexing data structure. An experiment is conducted on a subset of a 1000 images from the MNIST dataset and a subset of 210 images from the NIST dataset for the retrieval of handwritten character images. The obtained results show that the new similarity measure is very promising in terms of retrieval precision and execution time, when it is compared with other well-known (dis)similarity measures in the state-of-the-art.
The paper is organized as follows. Section
2 recalls the main concepts underlying the baseline ACSM measure from which the new similarity measure is derived. Section
3 presents the main features of the new similarity measure. Section
4 describes the experiment, i.e. the procedure for performance evaluation, the adopted datasets, and the competing methods. Section
5 shows the results in terms of retrieval precision and execution time. At the end, Section
6 draws a conclusion and outlines future work directions.
4 Experiment
We test the performances of eACSM on two benchmark datasets of handwritten character images. This is accomplished by evaluating the performances of a typical retrieval system where eACSM is employed as the similarity measure. Then, we compare eACSM with other well-known (dis)similarity measures previously introduced in the literature, and show the difference in terms of retrieval performance and execution time.
4.1 Performance Evaluation
Let
${I_{D}}$ be an image dataset where the images are categorized in different semantic classes. Also, let
Q be a query image not included in
${I_{\mathrm{D}}}$ and belonging to one of the semantic classes. The aim is to retrieve from
${I_{\mathrm{D}}}$ the top
K most (less) (dis)similar images to
Q which are also relevant. Images belonging to the same semantic class of
Q are considered relevant. In order to evaluate the retrieval accuracy, we employ the well-known
retrieval precision measure (Tourassi
et al.,
2007), which is defined as follows:
where
$\mathit{RI}$ is the number of relevant retrieved images and
K is the number of retrieved images. In order to compute the retrieval precision, the (dis)similarity between the query image
Q and each image in
${I_{\mathrm{D}}}$ is computed. Then, the images are sorted in increasing or decreasing order according to the obtained (dis)similarity values. At the end, the relevant images among the top
K retrieved images are selected, and their number
$\mathit{RI}$ is computed.
Because the retrieval precision is dependent on the selected query image, for a given value of K, we compute this measure for different query images. At the end, we average over the different query images and return the average retrieval precision for the evaluation. Hence, for each value of $K=1,\dots ,N$, where $N\ll |{I_{\mathrm{D}}}|$, we compute the average retrieval precision for the top K retrieved images.
Also, we evaluate the temporal performances of the (dis)similarity measure by computing the
Average Time per Query (
$\mathit{ATQ}$), which is the execution time (in seconds) required for computing the (dis)similarity between a query image
Q and each image in
${I_{\mathrm{D}}}$ averaged over the different query images. Basically, we introduce the
$\mathit{ATQ}$ as follows:
where
$\mathcal{Q}$ is the set of considered query images, and
${T_{Q}}$ is the total execution time required for computing the (dis)similarity between the current query image
Q and all images in
${I_{\mathrm{D}}}$.
4.2 Datasets
We perform the evaluation on a randomly selected subset of images from the MNIST dataset. MNIST is a dataset containing 60’000 training and 10’000 test images of handwritten digits divided into 10 semantic classes (the digits from 0 to 9). The images are extracted from the larger dataset NIST, a benchmark for testing pattern recognition and machine learning methods in image processing and handwritten character recognition. In particular, the training images are composed for one half of some test images from NIST and for the other half of some training images from NIST. The same composition of images is for the test set. Each image is centred and size-normalized.
From the original dataset, we randomly select for the experiment 100 images for each semantic class, in order to obtain a subset of 1000 images with 10 semantic classes. Then, we randomly select 7 query images from the original dataset, not included in the subset of 1000 images and belonging to 4 out of 10 semantic classes. Each image is of area $28\times 28$ in bitmap format (.bmp).
Figure
6 shows sample images from the MNIST dataset.
Fig. 6
Sample handwritten digit images from the MNIST dataset: (a) number 0, (b) number 1, (c) number 6, (d) number 7.
Also, we perform a second evaluation on a randomly selected subset of images from the NIST Special 19 dataset. NIST includes binary images of 3669 handwriting sample forms and 814 255 digit, uppercase and lowercase alphabet letter images segmented from those forms in Portable Network Graphics (.png) format. The area of each segmented image is $128\times 128$ pixels. All segmented images are divided in multiple categories for different recognition tasks: (i) by writer, (ii) by class, (iii) by caseless class (no distinction between uppercase and lowercase letters), and (iv) by field origin.
We extract an image subset corresponding to the only alphabet letters. Basically, we retrieve the relevant images corresponding to the class (letter of the alphabet) of the query image, independently from the writer and the field origin. Accordingly, we use the categorization of the segmented images by class. In this way, the retrieval task is more complex because of the different writing styles and field information.
From the original dataset, we randomly select a subset of 30 segmented images from 7 classes, for a total of 210 images. Also, we randomly extract a set of 5 query images not included in the subset of 210 images, each belonging to one of the 7 classes. Each image is resized to $28\times 28$ pixels and converted in bitmap (.bmp) format.
Figure
7 shows sample images from the NIST Special 19 dataset.
Fig. 7
Sample handwritten alphabet letter images from the NIST Special 19 dataset: (a) letter j, (b) letter k, (c) letter l, (d) letter m.
4.3 Competing Methods
Because the Common Submatrix approaches derive from the notion of shared information, which has roots in information theory (Ulitsky
et al.,
2006), we compare eACSM with other four information-theoretic dissimilarity measures usually adopted in image processing (Tourassi
et al.,
2007): (i) Joint entropy, (ii) Kullback–Leibler divergence, (iii) Jensen–Shannon divergence, and (iv) Arithmetic-geometric mean divergence.
The Joint entropy (
$\mathit{JE}$) is the entropy of the joint histogram of two images
A and
B. If the two images are similar, their joint entropy is low in comparison with the sum of their individual entropies. Consequently,
$\mathit{JE}$ is considered a distance measure:
The Kullback–Leibler divergence (
$\mathit{KL}$) measures the distance between the probability distribution of one image
$a(x)$ and the probability distribution of the second image
$b(x)$ (approximated as the image histograms).
$\mathit{KL}$ quantifies the average efficiency of adopting the histogram of one image for codifying the histogram of the second one. Consequently, the more similar the two images are, the lower is the
$\mathit{KL}$ value:
The Jensen–Shannon divergence (
$\mathit{JD}$) is a more stable version of
$\mathit{KL}$ to noise and variations in histogram binning. Also, it is a symmetric measure and limited between 0 and 2:
The Arithmetic-Geometric Mean divergence (
$\mathit{AGM}$) quantifies the
$\mathit{KL}$ divergence between the arithmetic and geometric means of the probability distributions
$a(x)$ and
$b(x)$ of the two images:
We use the image histograms for approximating the marginal probability distributions of the single images and the joint probability distributions. Accordingly, we set the number of bins of the histograms equal to 256, which is a sufficient number for obtaining a good approximation.
Finally, we compare the eACSM similarity measure with the baseline ACSM similarity measure and its approximate variant A-ACSM. Hence, we set the
α parameter of all methods to 5, which revealed the best performances in a similar experiment (Amelio and Pizzuti,
2016). In A-ACSM, we set the parameters of row and column offsets
${\Delta _{\mathrm{r}}}$ and
${\Delta _{\mathrm{c}}}$ respectively to 1 and 4, which experimentally obtained the highest accuracy (Amelio,
2016).