<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFO1130</article-id>
<article-id pub-id-type="doi">10.15388/Informatica.2017.118</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Popularity-Based Ranking for Fast Approximate kNN Search</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Antol</surname><given-names>Matej</given-names></name><email xlink:href="xantol@fi.muni.cz">xantol@fi.muni.cz</email><xref ref-type="aff" rid="j_info1130_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>M. Antol</bold> is a PhD student at the Faculty of Informatics, Masaryk University in Brno, Czech Republic. His area of interest is data analytics, namely optimization of hierarchical structures and their performance in metric spaces.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Dohnal</surname><given-names>Vlastislav</given-names></name><email xlink:href="dohnal@fi.muni.cz">dohnal@fi.muni.cz</email><xref ref-type="aff" rid="j_info1130_aff_001"/><bio>
<p><bold>V. Dohnal</bold> is an associate professor at the Faculty of Informatics, Masaryk University. He has been interested in database systems and information retrieval. In particular, indexing principles and mechanisms for unstructured data modelled as a metric space.</p></bio>
</contrib>
<aff id="j_info1130_aff_001">Faculty of Informatics, <institution>Masaryk University</institution>, Botanicka 68a, Brno, <country>Czech Republic</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2017</year></pub-date><pub-date pub-type="epub"><day>1</day><month>1</month><year>2017</year></pub-date><volume>28</volume><issue>1</issue><fpage>1</fpage><lpage>21</lpage><history><date date-type="received"><month>10</month><year>2016</year></date><date date-type="accepted"><month>2</month><year>2017</year></date></history>
<permissions><copyright-statement>© 2017 Vilnius University</copyright-statement><copyright-year>2017</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Similarity searching has become widely available in many on-line archives of multimedia data. Users accessing such systems look for data items similar to their specific query object and typically refine results by re-running the search with a query from the results. We study this issue and propose a mechanism of approximate kNN query evaluation that incorporates statistics of accessing index data partitions. Apart from the distance between database objects, it also considers the prior query answers to prioritize index partitions containing frequently retrieved data, so evaluating repetitive similar queries more efficiently. We verify this concept in a number of experiments.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>kNN query</kwd>
<kwd>approximate search</kwd>
<kwd>query popularity</kwd>
<kwd>index structure</kwd>
<kwd>metric space</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_info1130_s_001">
<label>1</label>
<title>Introduction</title>
<p>Content-based retrieval systems have become often applied to complement traditional retrieval systems. For example, photo stocks then provide a user with visually similar images to a given one. If he or she is not satisfied with the result, they may browse the database by issuing a new query by clicking on a previously returned image. This procedure exhibits the property that many queries processed by the system are alike, so search algorithms may optimize such repeated queries to save computational resources.</p>
<p>In general, the query efficiency is typically supported by various indexing structures, storage layouts and disk caching/buffering techniques. So the number of disk I/Os needed to answer a query is greatly reduced. However, handling more complex and unstructured data requires extracting so-called descriptors as surrogate information used to organize and query the data. The descriptors typically form high-dimensional spaces or even distance spaces where no implicit coordinate system is defined (Samet, <xref ref-type="bibr" rid="j_info1130_ref_023">2006</xref>). The problem of <italic>dimensionality curse</italic> then often appears (Böhm <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_006">2001</xref>). This typically leads to visiting many data partitions by an indexing mechanism due to high overlaps among them, whereas the useful information is obtained from few of them. Efficiency is then improved by further filtering constraints and optimized node-splitting strategies in the indexing structures (Skopal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_026">2005</xref>; Ciaccia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_010">1997</xref>) or by sacrificing precision in query results (approximate querying) (Amato <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_001">2003</xref>; Houle and Sakuma, <xref ref-type="bibr" rid="j_info1130_ref_015">2005</xref>; Houle and Nett, <xref ref-type="bibr" rid="j_info1130_ref_014">2015</xref>).</p>
<p>In this paper, we further study the issue of evaluating repeated queries and apply a general technique to prioritize data partitions during query evaluation. This technique exploits queries executed previously to gather necessary statistics, so the so-called popularity of data partitions can be established and applied to prioritize access to the partitions. We have proposed this technique in Antol and Dohnal (<xref ref-type="bibr" rid="j_info1130_ref_002">2016</xref>) and applied it to precise kNN queries. Here, we elaborate on this concept more and reveal a connection between algorithm parameters and data-set properties. Our findings are confirmed by experiments on two different data-sets (CoPhIR, Batko <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_004">2009</xref> and Profiset, Budikova <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_008">2011</xref>). Additionally, we applied our concept to evaluation of approximate kNN queries. The concept of popularity is commonly used in information retrieval for post-processing (ranking) the candidate set retrieved from an index. We rather make popularity inherent part of the search algorithm by warping the original distances, so it is applied during the searching itself.</p>
<p>The paper is structured as follows. In the next section, we summarize related work. Indexing and querying principles are concisely given in Section <xref rid="j_info1130_s_003">3</xref>. Inefficiency in performance of current indexes is presented in Section <xref rid="j_info1130_s_006">4</xref>. The application of Inverted Cache Index to approximate kNN query evaluation is described in Section <xref rid="j_info1130_s_009">5</xref> and its performance in Section <xref rid="j_info1130_s_010">6</xref>. Contributions of this paper and possible future extensions are summarized in Section <xref rid="j_info1130_s_015">7</xref>.</p>
</sec>
<sec id="j_info1130_s_002">
<label>2</label>
<title>Related Work</title>
<p>The wide area of indexing structures for metric space is surveyed in Zezula <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1130_ref_028">2005</xref>), Chávez <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1130_ref_009">2001</xref>). To process large data-sets, indexing structures are designed as disk-oriented. Their data partitioning principles are typically based on (i) hierarchical clustering (e.g. M-tree, Ciaccia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_010">1997</xref>), where each subtree is covered by a preselected data object (pivot) and a covering radius; (ii) Voronoi partitioning (e.g. M-index, Novak <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_018">2011</xref>, PPP-Codes, Novak and Zezula, <xref ref-type="bibr" rid="j_info1130_ref_019">2014</xref>), where subtrees are formed by assigning objects to the closest pivot recursively; and (iii) precomputed distances (e.g. Linear AESA, Vilar, <xref ref-type="bibr" rid="j_info1130_ref_027">1995</xref>), where no explicit structure is built, but rather distances among data objects are stored in a matrix.</p>
<p>Optimizations of query-evaluation algorithms are based on extending a hierarchical structure with additional precomputed distances to strengthen filtering capabilities, e.g. M<inline-formula id="j_info1130_ineq_001"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast }}$]]></tex-math></alternatives></inline-formula>-tree (Skopal and Hoksza, <xref ref-type="bibr" rid="j_info1130_ref_024">2007</xref>), cutting local pivots (Oliveira <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_020">2015</xref>); or on exploiting large number of pivots in a very compact and reusable way, e.g. permutation prefix index (Esuli, <xref ref-type="bibr" rid="j_info1130_ref_012">2012</xref>). These techniques, however, do not analyse the stored data and accesses to data partitions, but rather constrain the data partitions as much as possible.</p>
<p>Another way to make query evaluation much cheaper is to trade accuracy, i.e. to apply approximate searching. Existing approaches use early-termination or relaxed-branching strategies to stop searching prematurely, e.g. when the query result does improve marginally. A recent approach called spatial approximation sample hierarchy (Houle and Sakuma, <xref ref-type="bibr" rid="j_info1130_ref_015">2005</xref>) builds an approximated near-neighbour graph and does not exploit triangle inequality at all. We remind the reader that triangle inequality as a property of metric space is largely used to filter out irrelevant data partitions/objects. It was later improved and combined with cover trees to design Rank Cover Tree (Houle and Nett, <xref ref-type="bibr" rid="j_info1130_ref_014">2015</xref>).</p>
<p>Distance-Cache (Skopal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_025">2012</xref>) is a main-memory structure that collects information from previous querying. It caches some distances computed between metric objects to help forming tighter lower- and upper-bounds on distances between newly arriving queries and database objects. In this respect, it is applicable to any metric indexing structure, which is a resemblance with our approach.</p>
<p>With the advance of content-based retrieval, there are approaches to optimize a stream of kNN queries. Snake table (Barrios <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_003">2014</xref>) is a dynamically-built structure for optimizing all queries corresponding to one user session. It remembers results of all queries processed so far and constructs a linear AESA to evaluate next queries more efficiently. There are also approaches that intercept a collection of queries in regular intervals and by grouping the queries by their similarity, the evaluation of some of them can be done simultaneously. This is also partly done in the Snake table. A pure caching strategy of similarity queries was proposed in Falchi <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1130_ref_013">2009</xref>). On the other hand, a combination of query cache and index structure was proposed in Brisaboa <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1130_ref_007">2015</xref>). Its idea lies in reusing as much work spent in scanning the cache as possible in traversing an index structure. This is certainly advantageous if the query is not present in the cache. In principle, the list of clusters technique (Sadit Tellez and Chávez, <xref ref-type="bibr" rid="j_info1130_ref_022">2012</xref>) is built using the query objects only, which is analogous to the Snake table.</p>
<p>The Inverted Cache Index was recently proposed in Antol and Dohnal (<xref ref-type="bibr" rid="j_info1130_ref_002">2016</xref>). It takes into account previously processed queries too, which is similar to some approaches presented above. However, such information is exploited to increment so-called “popularity” of data partitions only. In this respect, no query object or answer is stored/cached there. This paper builds on this idea, proposes a parameter setting tight closely to the data-set indexed, and verifies it in approximate kNN query evaluation.</p>
</sec>
<sec id="j_info1130_s_003">
<label>3</label>
<title>Indexing and Querying Metric Spaces</title>
<p>We assume data is modelled in a metric space, so indexing techniques exploit properties of metric space to create a structure able to evaluate similarity queries efficiently.</p>
<sec id="j_info1130_s_004">
<label>3.1</label>
<title>Metric Space and Similarity Queries</title>
<p>A metric space <inline-formula id="j_info1130_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="script">M</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{M}$]]></tex-math></alternatives></inline-formula> is defined as a pair <inline-formula id="j_info1130_ineq_003"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\mathcal{D},d)$]]></tex-math></alternatives></inline-formula> of a domain <inline-formula id="j_info1130_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="script">D</mml:mi></mml:math><tex-math><![CDATA[$\mathcal{D}$]]></tex-math></alternatives></inline-formula> representing data objects and a pair-wise distance function <inline-formula id="j_info1130_ineq_005"><alternatives><mml:math>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo stretchy="false">↦</mml:mo>
<mml:mi mathvariant="double-struck">R</mml:mi></mml:math><tex-math><![CDATA[$d:\mathcal{D}\times \mathcal{D}\mapsto \mathbb{R}$]]></tex-math></alternatives></inline-formula> that satisfies: 
<disp-formula id="j_info1130_eq_001">
<alternatives><mml:math display="block">
<mml:mtable columnspacing="4.0pt" equalrows="false" columnlines="none" equalcolumns="false" columnalign="left left">
<mml:mtr>
<mml:mtd class="array">
<mml:mo>∀</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>⩾</mml:mo>
<mml:mn>0</mml:mn>
<mml:mspace width="2em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:mtext>non-negativity</mml:mtext>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo>∀</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:mtext>symmetry</mml:mtext>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo>∀</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">⇔</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mspace width="2em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:mtext>identity, and</mml:mtext>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo>∀</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">z</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mspace width="2em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:mtext>triangle inequality</mml:mtext>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{array}{l@{\hskip4.0pt}l}\forall x,y\in \mathcal{D},\hspace{1em}d(x,y)\geqslant 0\hspace{2em}& \text{non-negativity},\\ {} \forall x,y\in \mathcal{D},\hspace{1em}d(x,y)=d(y,x)\hspace{2em}& \text{symmetry},\\ {} \forall x,y\in \mathcal{D},\hspace{1em}x=y\Leftrightarrow d(x,y)=0\hspace{2em}& \text{identity, and}\\ {} \forall x,y,z\in \mathcal{D},\hspace{1em}d(x,z)\leqslant d(x,y)+d(y,z)\hspace{2em}& \text{triangle inequality}.\end{array}\]]]></tex-math></alternatives>
</disp-formula> 
The distance function is used to measure similarity between two objects. The shorter the distance is, the more similar the objects are. Consequently, a similarity query can be defined. There are many types of similarity queries (Deepak and Prasad, <xref ref-type="bibr" rid="j_info1130_ref_011">2015</xref>) but the range and k-nearest neighbour queries are the most important ones. The range query <inline-formula id="j_info1130_ineq_006"><alternatives><mml:math>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$R(q,r)$]]></tex-math></alternatives></inline-formula> specifies all database objects within the distance of <italic>r</italic> from the query object <inline-formula id="j_info1130_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="script">D</mml:mi></mml:math><tex-math><![CDATA[$q\in \mathcal{D}$]]></tex-math></alternatives></inline-formula>. In particular, <inline-formula id="j_info1130_ineq_008"><alternatives><mml:math>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">o</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">o</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">o</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$R(q,r)=\{o|o\in X,\hspace{2.5pt}d(q,o)\leqslant r\}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1130_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo stretchy="false">⊂</mml:mo>
<mml:mi mathvariant="script">D</mml:mi></mml:math><tex-math><![CDATA[$X\subset \mathcal{D}$]]></tex-math></alternatives></inline-formula> is the database to search in. In this paper, we primarily focus on k-nearest neighbour query since it is more convenient for users. The user wants to retrieve <italic>k</italic> most similar objects to <italic>q</italic> without the need to know details about the distance function <italic>d</italic>. Formally, <inline-formula id="j_info1130_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi></mml:math><tex-math><![CDATA[$kNN(q)=A$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1130_ineq_011"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>∧</mml:mo>
<mml:mo>∀</mml:mo>
<mml:mi mathvariant="italic">o</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">o</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$|A|=k\wedge \forall o\in A,\hspace{2.5pt}p\in X-A,\hspace{2.5pt}d(q,o)\leqslant d(q,p)$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
<sec id="j_info1130_s_005">
<label>3.2</label>
<title>Indexing and Query Evaluation</title>
<fig id="j_info1130_fig_001">
<label>Fig. 1</label>
<caption>
<p>Partitioning principle of M-tree.</p>
</caption>
<graphic xlink:href="info1130_g001.jpg"/>
</fig>
<p>To organize a database to answer similarity queries efficiently, many indexing structures have been proposed (Zezula <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_028">2005</xref>). Their principles are twofold: (i) recursively applied data partitioning/clustering defined by a preselected data object called <italic>pivot</italic> and a distance threshold, and (ii) effective object filtering using lower-bounds on distance between a database object and a query object. These principles have been firstly surveyed in Chávez <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1130_ref_009">2001</xref>).</p>
<p>In this paper, we use the traditional index M-tree (Ciaccia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_010">1997</xref>) and a recent technique M-index (Novak <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_018">2011</xref>). Both of these structures create an internal hierarchy of nodes that partition the data space into many buckets – an elementary object storage. Please refer to Figs. <xref rid="j_info1130_fig_001">1</xref> and <xref rid="j_info1130_fig_002">2</xref> for principles of their organization. M-tree organizes data objects in compact clusters created in the bottom-up fashion, where each cluster is represented by a pair <inline-formula id="j_info1130_ineq_012"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(p,{r^{c}})$]]></tex-math></alternatives></inline-formula> – a pivot and a covering radius, i.e. distance from the pivot to the farthest object in the cluster. On the other hand, M-index applies Voronoi-like partitioning using a predefined set of pivots in the top-down way. In this case, clusters are formed by objects that have the cluster’s pivot as the closest one. On next levels, the objects are reclustered using the other pivots, i.e. eliminating the pivot that formed the current cluster. Buckets of both structures store objects in leaf nodes, as is exampled in the illustration. So we use the terms <italic>leaf node</italic> and <italic>bucket</italic> interchangeably.</p>
<fig id="j_info1130_fig_002">
<label>Fig. 2</label>
<caption>
<p>Partitioning principle of M-index.</p>
</caption>
<graphic xlink:href="info1130_g002.jpg"/>
</fig>
<p>A kNN-query evaluation algorithm constructs a <italic>priority queue</italic> of nodes to access and gradually improves a set of candidate objects forming the final query answer when the algorithm finishes. The priority is defined in terms of a lower bound on distance between the node and the query object. So a probability of node to contain relevant data objects is estimated this way. In detail, the algorithm starts with inserting the root node of hierarchy. Then it repeatedly pulls the head of priority queue until the queue is empty. The algorithm terminates immediately, when the pulled head’s lower bound is greater than the distance of current <italic>k</italic>th neighbour to the query object. If the pulled element represents a leaf node, its corresponding bucket is accessed and all data objects stored there are checked against the query, so query’s answer is updated. If it is a non-leaf node, all its children are inserted into the queue with correct lower bounds estimated. M-tree defines the lower bound for a node <inline-formula id="j_info1130_ineq_013"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(p,{r^{c}})$]]></tex-math></alternatives></inline-formula> and a query object <italic>q</italic> as the distance <inline-formula id="j_info1130_ineq_014"><alternatives><mml:math>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$d(q,p)-{r^{c}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1130_ineq_015"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${r^{c}}$]]></tex-math></alternatives></inline-formula> is the covering radius forming a ball around the pivot <italic>p</italic>. For further details, we refer the reader to the cited papers where additional M-tree’s node filtering principles as well as the M-index’s approach, which is elaborate too, are described.</p>
</sec>
</sec>
<sec id="j_info1130_s_006">
<label>4</label>
<title>Effectiveness of Indexing and Approximate Searching</title>
<p>Interactivity of similarity queries is the main driving force to make content-based information retrieval widely used (Lew <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_017">2006</xref>). In the era of Big Data, near real-time execution of similarity queries over massive data collections is even more important, because it allows various analytic tasks to be implemented (Beecks <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_005">2011</xref>). In this section, we present motivating arguments based on experience with a real-life content-based retrieval system.</p>
<sec id="j_info1130_s_007">
<label>4.1</label>
<title>Query Statistics</title>
<p>We gathered statistics of querying from two demonstration applications.<xref ref-type="fn" rid="j_info1130_fn_001">1</xref><fn id="j_info1130_fn_001"><label><sup>1</sup></label>
<p><uri>http://disa.fi.muni.cz/prototype-applications/image-search/</uri>.</p></fn> The first one organizes the well-known CoPhIR data-set (Batko <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_004">2009</xref>) consisting of 100 million images using global MPEG-7 descriptors. Whereas the second application searches in the Profiset collection (Budikova <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_008">2011</xref>) that consists of 20 million high-quality images with rich and systematic annotations using descriptors from deep convolutional neural networks (Caffe descriptors) (Jia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_016">2014</xref>).</p>
<p>Figure <xref rid="j_info1130_fig_003">3</xref> shows absolute frequencies of individual top-1000 queries that were executed during the applications’ life time. The demo on CoPhIR was launched in Nov. 2008, while the demo on Profiset was made public in June 2015. These power-law like distributions are attributed to the way of presenting an initial search to a new website visitor. In CoPhIR demo, the users are initially provided with a similarity search to a query image randomly picked from a set of 105 preselected images, so there is a high frequency of such queries. In Profiset demo, the initial page contains 30 query images randomly selected from the whole data-set, so any repetition is caused by bookmarking or sharing popular images. There are few such queries only. Figure <xref rid="j_info1130_fig_004">4</xref> depicts density of distances among the top-1000 queries, so the reader may observe there are very similar query objects as well as distinct ones in CoPhIR. This proves that the users were also browsing the data collection as was described above. This phenomenon is almost negligible in Profiset, i.e. some similar queries are present but not many. For reference, we also include distance density of the whole data-sets depicted as solid curves in Fig. <xref rid="j_info1130_fig_004">4</xref>. To sum up, these two query sets form different conditions for our proposal to cope with.</p>
<fig id="j_info1130_fig_003">
<label>Fig. 3</label>
<caption>
<p>Distribution of top-1000 unique queries ordered by their appearances.</p>
</caption>
<graphic xlink:href="info1130_g003.jpg"/>
</fig>
<fig id="j_info1130_fig_004">
<label>Fig. 4</label>
<caption>
<p>Density of distances among top-1000 query objects (dashed curve) and overall density of the data-set (solid curve).</p>
</caption>
<graphic xlink:href="info1130_g004.jpg"/>
</fig>
</sec>
<sec id="j_info1130_s_008">
<label>4.2</label>
<title>Indexing Structure Performance</title>
<p>The major drawback of indexing structures in metric spaces is the high amount of overlaps among index substructures (data partitions), which is also supported by not very precise estimation of lower bounds on distances between data objects and a query object. So the kNN-query evaluation algorithm often accesses a large portion of indexing structure’s buckets to obtain precise answer to a query.</p>
<fig id="j_info1130_fig_005">
<label>Fig. 5</label>
<caption>
<p>Percentage of fully completed queries of 30NN for increasing number of accessed buckets on CoPhIR.</p>
</caption>
<graphic xlink:href="info1130_g005.jpg"/>
</fig>
<p>The selected indexing structure representatives were populated with 1 million data objects from the CoPhIR data-set and 30NN queries for the top-1000 query objects were evaluated. In Fig. <xref rid="j_info1130_fig_005">5</xref>, we present the percentage of fully completed queries while constraining the number of accessed buckets. We have tested two configurations for both M-tree and M-index – the capacity of buckets was constrained to 200 and 2,000 objects to have bushier and more compact structures. Table <xref rid="j_info1130_tab_001">1</xref> summarizes information about them. To this end, M-index’s building algorithm was initialized with 128 and 512 pivots picked at random from the data-set and the maximum depth of M-index’s internal hierarchy was limited to 8 and 6 for CoPhIR and Profiset, respectively. From the statistics, we can see that M-tree can adapt to data distribution better than M-index and does not create very low occupied buckets, so M-tree is a more compact data structure. However, the kNN evaluation algorithm of M-index can access promising data partitions early, so curve depicting percentage of fully completed queries is increasing more steeply.</p>
<table-wrap id="j_info1130_tab_001">
<label>Table 1</label>
<caption>
<p>Structure details of tested indexing techniques.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Indexing structure</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Data-set</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">bckt cap. (objs)</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Total bckts</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Avg bckt occup.</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Hierarchy height</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Internal node cap.</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">M-tree 200</td>
<td style="vertical-align: top; text-align: left">CoPhIR</td>
<td style="vertical-align: top; text-align: left">200</td>
<td style="vertical-align: top; text-align: left">11 571</td>
<td style="vertical-align: top; text-align: left">43.2%</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">50</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree 2000</td>
<td style="vertical-align: top; text-align: left">CoPhIR</td>
<td style="vertical-align: top; text-align: left">2 000</td>
<td style="vertical-align: top; text-align: left">1 124</td>
<td style="vertical-align: top; text-align: left">44.5%</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">100</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree 2000</td>
<td style="vertical-align: top; text-align: left">Profiset</td>
<td style="vertical-align: top; text-align: left">2 000</td>
<td style="vertical-align: top; text-align: left">2 634</td>
<td style="vertical-align: top; text-align: left">19.0%</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">100</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index 200</td>
<td style="vertical-align: top; text-align: left">CoPhIR</td>
<td style="vertical-align: top; text-align: left">200</td>
<td style="vertical-align: top; text-align: left">62 049</td>
<td style="vertical-align: top; text-align: left">8.1%</td>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left">N/D</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index 2000</td>
<td style="vertical-align: top; text-align: left">CoPhIR</td>
<td style="vertical-align: top; text-align: left">2 000</td>
<td style="vertical-align: top; text-align: left">10 943</td>
<td style="vertical-align: top; text-align: left">4.6%</td>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left">N/D</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">M-index 2000</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Profiset</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2 000</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">20 222</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2.5%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">6</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">N/D</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The other aspect of indexing structures to study is efficiency, i.e. how many data partitions are accessed to complete a 30NN query. In Fig. <xref rid="j_info1130_fig_006">6</xref>, we present the number of visits of individual buckets for M-tree and M-index structures after evaluating all top-1000 queries on CoPhIR and Profiset. In particular, we depict visited buckets for precise and approximate 30NN query evaluation with “O” and “X”, respectively. The curve denoted with “+” corresponds to the buckets that contained at least one data object returned in the precise answer. Thus, it forms the minimal requirements to evaluate the queries. The approximate evaluation was terminated after visiting 303 buckets of M-tree (27.5%) and 254 buckets of M-index (2.3%) for CoPhIR, and 1 714 buckets of M-tree (65.9%) and 779 buckets of M-index (3.8%) for Profiset data-set. These thresholds were selected to correspond to the number of visited buckets that contains complete answer of 70% queries. The average precision of approximate queries is 97.5%, for both M-tree and M-index. From the figure, we can see very large inefficiency in navigating to buckets containing relevant data and terminating the search when precise answer is found, i.e. very loose estimated lower bounds on distance between the query object and a data partition. Even though approximate evaluation provides large performance boost, there is still a large number of buckets that do not need to be accessed at all (more than 90%). Once again, the advantage of M-index over M-tree is in navigating to relevant buckets earlier.</p>
<fig id="j_info1130_fig_006">
<label>Fig. 6</label>
<caption>
<p>Absolute frequency of visiting a bucket after evaluation of all queries on CoPhIR and Profiset data-sets. The <italic>x</italic>-axis displays buckets sorted by frequency descendingly, so different curves imply different ordering. The <italic>y</italic>-axis is in log scale.</p>
</caption>
<graphic xlink:href="info1130_g006.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_info1130_s_009">
<label>5</label>
<title>Popularity-Based Approximate Query Evaluation</title>
<p>In this section, we describe a technique for prioritizing nodes of indexing hierarchies to locate relevant data objects earlier during kNN query evaluation. This technique is based on accumulating frequencies of accessing individual data partitions to re-order its priority queue during a query evaluation. We call this technique <italic>Inverted Cache Index</italic> (ICI). It does not record the queries processed so far, but rather the number of times a given partition/bucket (or even data object) contributed to the final result of such queries. This technique was originally proposed in Antol and Dohnal (<xref ref-type="bibr" rid="j_info1130_ref_002">2016</xref>). In this article, we further study its properties and parameter settings and, more importantly, apply it to approximate kNN queries.</p>
<p>ICI requires any underlaying indexing structure to be extended with one counter per bucket. This counter accumulates the number of times the bucket contributed to the final query answer. If an indexing structure is hierarchical, this is applied also in internal “nodes”. In particular, we apply the variant called <italic>object ratio</italic> (Antol and Dohnal, <xref ref-type="bibr" rid="j_info1130_ref_002">2016</xref>). This ICI counter is then used to stretch/extend the original metric distance between the bucket’s or node’s representative (<italic>p</italic> – pivot) and a query (<italic>q</italic>) as follows: 
<disp-formula id="j_info1130_eq_002">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">ICI</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">pwr</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">base</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">ICI</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">base</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {d_{\mathit{ICI}}}=\frac{d(q,p)\cdot \big({\big(\frac{d(q,p)}{{d_{\mathit{norm}}}}\big)^{\mathit{pwr}}}+1\big)}{{\log _{\mathit{base}}}(\mathit{ICI}+\mathit{base})}\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1130_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">ICI</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{ICI}}}$]]></tex-math></alternatives></inline-formula> denotes the modified distance and <inline-formula id="j_info1130_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula> is the normalization distance. The parameter <italic>base</italic> is fixed to 10 and the parameter <italic>pwr</italic> is set to 2 for M-tree and 5 for M-index, which is recommended in Antol and Dohnal (<xref ref-type="bibr" rid="j_info1130_ref_002">2016</xref>). The normalization distance controls the point where the “attraction” force turns into the “repulsive” force to push irrelevant data partitions away. It may be set to any value up to the maximum distance of the data-set, which is 10 in CoPhIR and 200 in Profiset.</p>
<p>Since approximate query evaluation allows some imprecision in the final answer, we do not take the whole answer to update ICI counters. In particular, we focus on a portion of the closest objects only, corresponding to the expected result precision. The amount of them is a parameter of Algorithm <xref rid="j_info1130_fig_007">1</xref> as well as the maximum number of visited buckets.</p>
<fig id="j_info1130_fig_007">
<label>Algorithm 1</label>
<caption>
<p>Approximate kNN query evaluation with ICI.</p>
</caption>
<graphic xlink:href="info1130_g007.jpg"/>
</fig>
</sec>
<sec id="j_info1130_s_010">
<label>6</label>
<title>Experiments</title>
<p>In this section, we provide an extensive experimental comparison of ICI-optimized precise and approximate kNN query evaluation with standard (non-ICI) algorithms. Two different data-sets are used through the experiments. First, a 1-million-object subset of CoPhIR data-set is used. Each object is formed by five MPEG-7 global descriptors (282 dimensional vector in total) and the distance function is a weighted sum of <inline-formula id="j_info1130_ineq_018"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1130_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${L_{2}}$]]></tex-math></alternatives></inline-formula> metrics (Batko <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_004">2009</xref>). Second, a 1-million-object subset of Profimedia data-set is picked (Budikova <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_008">2011</xref>). Here, Caffe descriptors (4 096 dimensional vectors) are used (Jia <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_016">2014</xref>). The metric is Euclidean distance.</p>
<sec id="j_info1130_s_011">
<label>6.1</label>
<title>Different Query Ordering Strategies</title>
<p>The first group of experiments focuses on determining the best setting of <inline-formula id="j_info1130_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">ICI</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{ICI}}}$]]></tex-math></alternatives></inline-formula> distance measure. We used M-tree with leaf node capacity fixed to 200 only and the other parameters fixed to log base 10 and to power of 2. We studied the progress of fully completed queries at particular number of accessed nodes (buckets). The results are depicted in Fig. <xref rid="j_info1130_fig_008">7</xref>, where the following approaches were compared: 
<def-list><def-item><term><bold>original</bold></term><def>
<p>– M-tree’s algorithm for precise kNN evaluation (search queue ordered by lower-bound distance <inline-formula id="j_info1130_ineq_021"><alternatives><mml:math>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">pivot</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">covering</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$=(d(q,\mathit{pivot})-{r_{\mathit{covering}}})$]]></tex-math></alternatives></inline-formula>;</p></def></def-item><def-item><term><bold>qdg</bold></term><def>
<p>– the proposed ICI-ordered queue, where ICI counter is updated for unique queries only;</p></def></def-item><def-item><term><bold>qdg-freq</bold></term><def>
<p>– the same as “qdg”, but incrementing ICI counters for all processed queries (including repeated queries).</p></def></def-item></def-list> The results show that the concept of ICI is valid as the percentage of completed queries rises faster than the original queue ordering based on lower bounds on distance. So the ordering that exploits the real distance between the query object and a pivot (node’s representative) describes distribution of data objects in a node better, which is in compliance with the angle property defined by Pramanik <italic>et al</italic>. for node filtering (Pramanik <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_021">1999</xref>). The best results are exhibited by the ICI strategy with counters advanced for every query executed, i.e. including repeated queries. We will examine this strategy thoroughly in the following sections.</p>
<fig id="j_info1130_fig_008">
<label>Fig. 7</label>
<caption>
<p>Percentage of fully completed queries for standard M-tree’s priority queue ordering and ICI-based ordering.</p>
</caption>
<graphic xlink:href="info1130_g008.jpg"/>
</fig>
</sec>
<sec id="j_info1130_s_012">
<label>6.2</label>
<title>Precise kNN Evaluation</title>
<p>The second group of experiments focuses on precise kNN evaluation and impact of ICI in M-tree’s and M-index’s algorithms. We tested M-tree and M-index with capacity of buckets set to 2000 on both data collections. The test protocol is to take all queries processed by a demo application in a longer period of time and to separate it into training (adapting) and testing sub-sets.</p>
<p>For CoPhIR, we used traffic in the year of 2009 for counting the popularity and queries processed in January, 2010 for performance verification. They consisted of 993 and 1000 query objects, respectively. About 10% queries are in both the sub-sets in common and the remaining 90% are unique. For Profiset, the training set contained traffic from June, 2015 to February, 2016, whereas the testing one consisted of 6-week traffic following the training period. All tests were performed to evaluate precise 30NN queries.</p>
<p>Figure <xref rid="j_info1130_fig_009">8</xref> depicts the total percentage of completed queries while the evaluation progresses in terms of accessed buckets. We can observe that ICI can optimize performance substantially for M-tree, because M-tree visits large portion of all buckets to complete all queries (up to 80%). M-index is much more efficient in the lookup of relevant data partitions (about 30% of all buckets) and ICI-base ordering becomes better when at least 85% queries get completed on CoPhIR. Tables <xref rid="j_info1130_tab_002">2</xref> and <xref rid="j_info1130_tab_003">3</xref> list exact numbers of performance. We can conclude that ICI-optimized queue ordering is able to make more than 20% performance gain for 95% queries completed.</p>
<table-wrap id="j_info1130_tab_002">
<label>Table 2</label>
<caption>
<p>Improvement in query costs of precise 30NN for CoPhIR.</p>
</caption>
<table>
<thead>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">CoPhIR</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">50% queries completed</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">95% queries completed</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Indexing structure</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Log-pwr setup</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Orig. nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total improvement</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Orig. nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">M-tree</td>
<td style="vertical-align: top; text-align: left">10-2</td>
<td style="vertical-align: top; text-align: left">201</td>
<td style="vertical-align: top; text-align: left">153</td>
<td style="vertical-align: top; text-align: left">23.9%</td>
<td style="vertical-align: top; text-align: left">588</td>
<td style="vertical-align: top; text-align: left">495</td>
<td style="vertical-align: top; text-align: left">15.8%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree</td>
<td style="vertical-align: top; text-align: left">10-5</td>
<td style="vertical-align: top; text-align: left">201</td>
<td style="vertical-align: top; text-align: left">149</td>
<td style="vertical-align: top; text-align: left">25.9%</td>
<td style="vertical-align: top; text-align: left">588</td>
<td style="vertical-align: top; text-align: left">476</td>
<td style="vertical-align: top; text-align: left">19%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index</td>
<td style="vertical-align: top; text-align: left">10-2</td>
<td style="vertical-align: top; text-align: left">89</td>
<td style="vertical-align: top; text-align: left">125</td>
<td style="vertical-align: top; text-align: left">-40.45%</td>
<td style="vertical-align: top; text-align: left">1495</td>
<td style="vertical-align: top; text-align: left">1009</td>
<td style="vertical-align: top; text-align: left">32.5%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">M-index</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">10-5</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">89</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">130</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">-20.2%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1495</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">948</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">36.6%</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1130_tab_003">
<label>Table 3</label>
<caption>
<p>Improvement in query costs of precise 30NN for Profiset.</p>
</caption>
<table>
<thead>
<tr>
<td colspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Profiset</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">50% queries completed</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">95% queries completed</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Indexing structure</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Log-pwr setup</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Orig. nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total improvement</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Orig. nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Nodes needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Total improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">M-tree</td>
<td style="vertical-align: top; text-align: left">10-2</td>
<td style="vertical-align: top; text-align: left">1521</td>
<td style="vertical-align: top; text-align: left">339</td>
<td style="vertical-align: top; text-align: left">77.7%</td>
<td style="vertical-align: top; text-align: left">2145</td>
<td style="vertical-align: top; text-align: left">1401</td>
<td style="vertical-align: top; text-align: left">34.7%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree</td>
<td style="vertical-align: top; text-align: left">10-5</td>
<td style="vertical-align: top; text-align: left">1521</td>
<td style="vertical-align: top; text-align: left">416</td>
<td style="vertical-align: top; text-align: left">72.6%</td>
<td style="vertical-align: top; text-align: left">2145</td>
<td style="vertical-align: top; text-align: left">1652</td>
<td style="vertical-align: top; text-align: left">23%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index</td>
<td style="vertical-align: top; text-align: left">10-2</td>
<td style="vertical-align: top; text-align: left">349</td>
<td style="vertical-align: top; text-align: left">275</td>
<td style="vertical-align: top; text-align: left">21.2%</td>
<td style="vertical-align: top; text-align: left">3346</td>
<td style="vertical-align: top; text-align: left">2460</td>
<td style="vertical-align: top; text-align: left">26.5%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">M-index</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">10-5</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">349</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">253</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">27.5%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3346</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2656</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">20.6%</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="j_info1130_fig_009">
<label>Fig. 8</label>
<caption>
<p>Influence of ICI on precise 30NN query evaluation.</p>
</caption>
<graphic xlink:href="info1130_g009.jpg"/>
</fig>
</sec>
<sec id="j_info1130_s_013">
<label>6.3</label>
<title>Evolution of Performance on Monthly Basis</title>
<p>In the previous trials, the period of gathering ICI statistics was long and we focused on viability of it only. Here, we study the performance while sliding the training and testing periods month by month. Since the Profiset demo application has been set up recently, we conduct this experiment on CoPhIR data-set only. Precise kNN queries are tested again, so the consistency of ICI’s impact through series of consequent time frames can be compared. This also corresponds to the scenario closer to possible future applications, where the statistics should be rolled over periodically.</p>
<p>For every run of experiments depicted in Fig. <xref rid="j_info1130_fig_010">9</xref>, three-month traffic was used as a training data-set. The testing phase on the following-month traffic is executed on two versions of query evaluation algorithm: the original precise kNN and precise kNN with ICI-based ordering. Graphs in the figure depict results of experiments for the thresholds of 50, 80 and 95% queries completed, i.e. the average number of buckets needed to answer 50, 80 and 95% queries completely. The experiments are repeated through 13 consecutive time frames, covering more than a year of traffic. Every two consecutive months shared only 7% queries on average. In Table <xref rid="j_info1130_tab_004">4</xref>, there are numbers of training and testing queries in individual time periods and their overlap.</p>
<p>Conducted experiments show us considerably uniform improvement achieved by ICI over the original algorithm for precise kNN query. All experiments on M-tree display improvement in query answering greater than 10%. Average improvement for threshold of 95% completed queries is 19.4%. As for M-index, similar results are obtained considering the same threshold: average improvement of 19.2%. M-index’s original kNN algorithm performs better for highly concentrated queries (answer is distributed in few buckets) – M-index completed 50% queries by visiting up to 89 buckets, which is less than 1% of all buckets. However, the other queries can be optimized by ICI ordering, where the overall improvement varies around 10–15%.</p>
<table-wrap id="j_info1130_tab_004">
<label>Table 4</label>
<caption>
<p>Amount of queries for every training and testing period with the percentage of mutually same queries.</p>
</caption>
<table>
<tbody>
<tr>
<td colspan="12" style="vertical-align: top; text-align: left; border-bottom: solid thin"><italic>Testing time span</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Jan.</td>
<td style="vertical-align: top; text-align: left">Feb.</td>
<td style="vertical-align: top; text-align: left">Mar.</td>
<td style="vertical-align: top; text-align: left">Apr.</td>
<td style="vertical-align: top; text-align: left">May</td>
<td style="vertical-align: top; text-align: left">Jun.</td>
<td style="vertical-align: top; text-align: left">Jul.</td>
<td style="vertical-align: top; text-align: left">Aug.</td>
<td style="vertical-align: top; text-align: left">Sep.</td>
<td style="vertical-align: top; text-align: left">Oct.</td>
<td style="vertical-align: top; text-align: left">Nov.</td>
<td style="vertical-align: top; text-align: left">Dec.</td>
</tr>
<tr>
<td colspan="12" style="vertical-align: top; text-align: left"><italic>Number of testing queries</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">12,144</td>
<td style="vertical-align: top; text-align: left">8,279</td>
<td style="vertical-align: top; text-align: left">12,887</td>
<td style="vertical-align: top; text-align: left">7,392</td>
<td style="vertical-align: top; text-align: left">6,173</td>
<td style="vertical-align: top; text-align: left">4,592</td>
<td style="vertical-align: top; text-align: left">2,359</td>
<td style="vertical-align: top; text-align: left">4,346</td>
<td style="vertical-align: top; text-align: left">4,423</td>
<td style="vertical-align: top; text-align: left">4,444</td>
<td style="vertical-align: top; text-align: left">3,065</td>
<td style="vertical-align: top; text-align: left">8,520</td>
</tr>
<tr>
<td colspan="12" style="vertical-align: top; text-align: left"><italic>Number of training queries</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">62,994</td>
<td style="vertical-align: top; text-align: left">75,138</td>
<td style="vertical-align: top; text-align: left">80,650</td>
<td style="vertical-align: top; text-align: left">33,310</td>
<td style="vertical-align: top; text-align: left">28,558</td>
<td style="vertical-align: top; text-align: left">26,452</td>
<td style="vertical-align: top; text-align: left">18,157</td>
<td style="vertical-align: top; text-align: left">13,124</td>
<td style="vertical-align: top; text-align: left">11,297</td>
<td style="vertical-align: top; text-align: left">11,128</td>
<td style="vertical-align: top; text-align: left">13,213</td>
<td style="vertical-align: top; text-align: left">11,932</td>
</tr>
<tr>
<td colspan="12" style="vertical-align: top; text-align: left"><italic>Percentage of testing queries contained in the training ones</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">26.68</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">18.28</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">14.57</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">14.09</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">9.32</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">12.46</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">16.08</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7.21</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">17.32</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7.52</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8.67</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">10.12</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="j_info1130_fig_010">
<label>Fig. 9</label>
<caption>
<p>Results of experiments on 13 consecutive months on CoPhIR. The green stripe emphasizes the improvement of ICI.</p>
</caption>
<graphic xlink:href="info1130_g010.jpg"/>
</fig>
</sec>
<sec id="j_info1130_s_014">
<label>6.4</label>
<title>Approximate kNN Evaluation</title>
<p>One of the challenges associated with the usage of ICI is tuning its parameters as presented in Eq. (<xref rid="j_info1130_eq_002">1</xref>) in Section <xref rid="j_info1130_s_009">5</xref>, namely: <inline-formula id="j_info1130_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1130_ineq_023"><alternatives><mml:math>
<mml:mi mathvariant="italic">pwr</mml:mi></mml:math><tex-math><![CDATA[$\mathit{pwr}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1130_ineq_024"><alternatives><mml:math>
<mml:mi mathvariant="italic">base</mml:mi></mml:math><tex-math><![CDATA[$\mathit{base}$]]></tex-math></alternatives></inline-formula>.</p>
<p>To make our method widely applicable, experiments in this section use values of these parameters derived from statistics of queries, indexing structures and data-sets. As was already mentioned, the parameter <inline-formula id="j_info1130_ineq_025"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula> plays the role of a vanishing point, where two data items should become irrelevant or dissimilar. Here, we set its value to the average distance between a pair of objects in the data-set, which is 2.526 for CoPhIR and 85.84 for Profiset.</p>
<p>The parameter <inline-formula id="j_info1130_ineq_026"><alternatives><mml:math>
<mml:mi mathvariant="italic">pwr</mml:mi></mml:math><tex-math><![CDATA[$\mathit{pwr}$]]></tex-math></alternatives></inline-formula> specifies the relative impact of ICI in comparison with the original distance. In particular, it must be interpreted in two ways: (i) if a data object or partition is closer than <inline-formula id="j_info1130_ineq_027"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula>, the higher the value of <inline-formula id="j_info1130_ineq_028"><alternatives><mml:math>
<mml:mi mathvariant="italic">pwr</mml:mi></mml:math><tex-math><![CDATA[$\mathit{pwr}$]]></tex-math></alternatives></inline-formula> is, the <italic>greater</italic> the influence of ICI is; (ii) if a data object is further than <inline-formula id="j_info1130_ineq_029"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula>, the higher the value of <inline-formula id="j_info1130_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">pwr</mml:mi></mml:math><tex-math><![CDATA[$\mathit{pwr}$]]></tex-math></alternatives></inline-formula> is, the <italic>smaller</italic> the influence of ICI is. The strength of <inline-formula id="j_info1130_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="italic">pwr</mml:mi></mml:math><tex-math><![CDATA[$\mathit{pwr}$]]></tex-math></alternatives></inline-formula> can be correlated to the data-set dimensionality – emphasis on ICI or distance should be and is deduced from the intrinsic dimensionality (iDim) of data (Chávez <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1130_ref_009">2001</xref>). For CoPhIR and Profiset data-sets, the value of iDim is 15 and 27, respectively.</p>
<fig id="j_info1130_fig_011">
<label>Fig. 10</label>
<caption>
<p>Approximate kNN for varying limit on accessed buckets.</p>
</caption>
<graphic xlink:href="info1130_g011.jpg"/>
</fig>
<p>Similarly to <inline-formula id="j_info1130_ineq_032"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">norm</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${d_{\mathit{norm}}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1130_ineq_033"><alternatives><mml:math>
<mml:mi mathvariant="italic">base</mml:mi></mml:math><tex-math><![CDATA[$\mathit{base}$]]></tex-math></alternatives></inline-formula> serves mainly a normalization purpose: to reduce logarithmically the impact of nodes and objects with overly-high values of ICI. As the value of ICI corresponds to the number of queries processed, we deduce the value of <inline-formula id="j_info1130_ineq_034"><alternatives><mml:math>
<mml:mi mathvariant="italic">base</mml:mi></mml:math><tex-math><![CDATA[$\mathit{base}$]]></tex-math></alternatives></inline-formula> from the average number of queries (or average value of ICI) per bucket. Due to the behaviour of logarithmic function, some values are impractical or disallowed, so we should be constrained: <inline-formula id="j_info1130_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">base</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>;</mml:mo>
<mml:mn>15</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$\mathit{base}\in [5;15]$]]></tex-math></alternatives></inline-formula>. The values of <inline-formula id="j_info1130_ineq_036"><alternatives><mml:math>
<mml:mi mathvariant="italic">base</mml:mi></mml:math><tex-math><![CDATA[$\mathit{base}$]]></tex-math></alternatives></inline-formula> used here are:</p>
<list>
<list-item id="j_info1130_li_001">
<label>•</label>
<p>15 for M-tree 2000 on CoPhIR (130 000 queries/1124 buckets = 115);</p>
</list-item>
<list-item id="j_info1130_li_002">
<label>•</label>
<p>12 for M-index 2000 on CoPhIR (130 000 queries/10 943 buckets = 12);</p>
</list-item>
<list-item id="j_info1130_li_003">
<label>•</label>
<p>5 for M-tree 2000 on Profiset (4059 queries/2634 buckets = 1.54);</p>
</list-item>
<list-item id="j_info1130_li_004">
<label>•</label>
<p>5 for M-index 2000 on Profiset (4059 queries/20 222 buckets = 0.20).</p>
</list-item>
</list>
<p>To verify such setting, we conducted a group of experiments on CoPhIR and Profiset for approximate kNN query evaluation. The results are presented in Fig. <xref rid="j_info1130_fig_011">10</xref>. We picked nine thresholds on the number of accessed buckets that correspond to the requirements of index structures to complete 10, 20, 30, 50, 60, 70, 80, 90 and 95% queries precisely. The exact limits on the number of accessed buckets are given as labels of <italic>x</italic>-axis.</p>
<p>Interpretation of the results obtained is twofold.</p>
<p>First, the difference between the approaches can be read horizontally, i.e. comparing the number of accessed buckets within the same precision. It indicates the possibility to lower the approximation threshold, which results into saving time to evaluate a kNN query. Table <xref rid="j_info1130_tab_005">5</xref> gives approximate results of such optimization for kNN precision 90% and 95%. We can see that ICI is very competitive for all setups.</p>
<p>Second, it follows the vertical comparison of results, i.e. the change in precision within the same amount of accessed buckets. Table <xref rid="j_info1130_tab_006">6</xref> summarizes the values of precision obtained for both the variants when the original approximation algorithm exceeded 90% and 95% precision, respectively. This demonstrates improvement in precision when ICI is applied.</p>
<p>Finally, the reader shall remember that we increment ICI counters only for first few objects in answer when queries are evaluated approximately. In Algorithm <xref rid="j_info1130_fig_007">1</xref>, ICI is increased for buckets and their parents covering the portion corresponding to the expected precision. The smaller the required precision is, the fewer buckets get their ICI increased. For example, an estimated precision of 90% on 30NN leads to updating buckets containing the first 27 objects. This certainly requires more analysis, which we plan as the future work.</p>
<table-wrap id="j_info1130_tab_005">
<label>Table 5</label>
<caption>
<p>Improvement in query costs for kNN precision 90% and 95%.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Structure, data-set</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">First precision ⩾ 90%</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">First precision ⩾ 95%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Original buckets needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ICI buckets needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Speedup</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Original buckets needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ICI buckets needed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Speedup</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">M-tree, CoPhIR</td>
<td style="vertical-align: top; text-align: left">141</td>
<td style="vertical-align: top; text-align: left">107</td>
<td style="vertical-align: top; text-align: left">24%</td>
<td style="vertical-align: top; text-align: left">250</td>
<td style="vertical-align: top; text-align: left">170</td>
<td style="vertical-align: top; text-align: left">32%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index, CoPhIR</td>
<td style="vertical-align: top; text-align: left">89</td>
<td style="vertical-align: top; text-align: left">80</td>
<td style="vertical-align: top; text-align: left">10%</td>
<td style="vertical-align: top; text-align: left">180</td>
<td style="vertical-align: top; text-align: left">140</td>
<td style="vertical-align: top; text-align: left">22%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree, Profiset</td>
<td style="vertical-align: top; text-align: left">1200</td>
<td style="vertical-align: top; text-align: left">&lt;800</td>
<td style="vertical-align: top; text-align: left">&gt;30%</td>
<td style="vertical-align: top; text-align: left">1400</td>
<td style="vertical-align: top; text-align: left">&lt;800</td>
<td style="vertical-align: top; text-align: left">&gt;40%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">M-index, Profiset</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">200</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">160</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">20%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">420</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">350</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">17%</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1130_tab_006">
<label>Table 6</label>
<caption>
<p>Improvement in precision for different amount of visited buckets.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Structure, data-set</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">First threshold ⩾ 90%</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">First threshold ⩾ 95%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Original precision</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ICI precision</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Improvement</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Original precision</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">ICI precision</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">M-tree, CoPhIR</td>
<td style="vertical-align: top; text-align: left">90.3%</td>
<td style="vertical-align: top; text-align: left">93.8%</td>
<td style="vertical-align: top; text-align: left">3.5%</td>
<td style="vertical-align: top; text-align: left">97.5%</td>
<td style="vertical-align: top; text-align: left">98.7%</td>
<td style="vertical-align: top; text-align: left">1.2%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-index, CoPhIR</td>
<td style="vertical-align: top; text-align: left">90.7%</td>
<td style="vertical-align: top; text-align: left">92.6%</td>
<td style="vertical-align: top; text-align: left">1.9%</td>
<td style="vertical-align: top; text-align: left">97.5%</td>
<td style="vertical-align: top; text-align: left">98.0%</td>
<td style="vertical-align: top; text-align: left">0.5%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">M-tree, Profiset</td>
<td style="vertical-align: top; text-align: left">93.1%</td>
<td style="vertical-align: top; text-align: left">99.8%</td>
<td style="vertical-align: top; text-align: left">6.7%</td>
<td style="vertical-align: top; text-align: left">96.8%</td>
<td style="vertical-align: top; text-align: left">99.9%</td>
<td style="vertical-align: top; text-align: left">3.1%</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">M-index, Profiset</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">94.1%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">95.4%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1.3%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">96.0%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">97.0%</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1.0%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="j_info1130_s_015">
<label>7</label>
<title>Conclusion and Future Work</title>
<p>Building on the proposal called Inverted Cache Index (ICI), we presented further analysis of its parameters and additional experimental trials on CoPhIR and Profiset data collections. The results for precise kNN evaluation proved that recording previous accesses to data partitions gives performance improvement. ICI is generally applicable to any data structure and provides gains in tens of per cent in query response time, which directly corresponds to the number of accessed data partitions.</p>
<p>The next contribution lies in approximate kNN query evaluation with ICI. We made a proposal of setting the values of ICI parameters that is based on intrinsic dimensionality of data, the number of data partitions created by an indexing structure to organize a data-set, and amount of queries processed within a desired period of time. Though such a proposal is initial, it provides consistently better results.</p>
<p>The outcome of this article motivates our future work that will focus on proposing an automatic ICI parameter selection and its deep theoretical and experimental evaluation. We will also investigate the possibility to adaptively swap between the original priority query ordering and ICI-based ordering. Such a procedure might lead to improvement of all queries, i.e. also the queries that are not similar to any previously executed.</p>
</sec>
</body>
<back>
<ack id="j_info1130_ack_001">
<title>Acknowledgements</title>
<p>This work was supported by Czech Science Foundation project GA16-18889S.</p></ack>
<ref-list id="j_info1130_reflist_001">
<title>References</title>
<ref id="j_info1130_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Amato</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Rabitti</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Savino</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (<year>2003</year>). <article-title>Region proximity in metric spaces and its use for approximate similarity search</article-title>. <source>ACM Transactions on Information Systems (TOIS 203)</source>, <volume>2003</volume>(<issue>21</issue>), <fpage>192</fpage>–<lpage>227</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Antol</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Dohnal</surname>, <given-names>V.</given-names></string-name> (<year>2016</year>). <chapter-title>Optimizing query performance with inverted cache in metric spaces</chapter-title>. In: <source>Proceedings of ADBIS 2016: 20th East-European Conference on Advances in Databases and Information Systems</source>, pp. <fpage>1</fpage>–<lpage>14</lpage>. <comment>Accepted for publication. For the reviewers, it is available at <ext-link ext-link-type="uri" xlink:href="https://www.fi.muni.cz/ xdohnal/adbis2016.pdf">https://www.fi.muni.cz/ xdohnal/adbis2016.pdf</ext-link></comment>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Barrios</surname>, <given-names>J.M.</given-names></string-name>, <string-name><surname>Bustos</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Skopal</surname>, <given-names>T.</given-names></string-name> (<year>2014</year>). <article-title>Analyzing and dynamically indexing the query set</article-title>. <source>Information Systems</source>, <volume>45</volume>, <fpage>37</fpage>–<lpage>47</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_004">
<mixed-citation publication-type="journal"><string-name><surname>Batko</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Falchi</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Lucchese</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Novak</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Perego</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Rabitti</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Sedmidubsky</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>). <article-title>Building a web-scale image similarity search system</article-title>. <source>Multimedia Tools and Applications</source>, <volume>47</volume>(<issue>3</issue>), <fpage>599</fpage>–<lpage>629</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_005">
<mixed-citation publication-type="chapter"><string-name><surname>Beecks</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Skopal</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Schöffmann</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Seidl</surname>, <given-names>T.</given-names></string-name> (<year>2011</year>). <chapter-title>Towards large-scale multimedia exploration</chapter-title>. In: <source>Proceedings of 5th International Workshop on Ranking in Databases (DBRank 2011)</source>, <conf-loc>Seattle, WA, USA</conf-loc>, pp. <fpage>31</fpage>–<lpage>33</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Böhm</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Berchtold</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Keim</surname>, <given-names>D.A.</given-names></string-name> (<year>2001</year>). <article-title>Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases</article-title>. <source>ACM Computing Surveys</source>, <volume>33</volume>(<issue>3</issue>), <fpage>322</fpage>–<lpage>373</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_007">
<mixed-citation publication-type="chapter"><string-name><surname>Brisaboa</surname>, <given-names>N.R.</given-names></string-name>, <string-name><surname>Cerdeira-Pena</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gil-Costa</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Marin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Pedreira</surname>, <given-names>O.</given-names></string-name> (<year>2015</year>). <chapter-title>Efficient similarity search by combining indexing and caching strategies</chapter-title>. In: <source>Proceedings of SOFSEM 2015: Theory and Practice of Computer Science: 41st International Conference on Current Trends in Theory and Practice of Computer Science</source>. <publisher-name>Springer</publisher-name>, <publisher-loc>Berlin</publisher-loc>, pp. <fpage>486</fpage>–<lpage>497</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_008">
<mixed-citation publication-type="chapter"><string-name><surname>Budikova</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Batko</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (<year>2011</year>). <chapter-title>Evaluation platform for content-based image retrieval systems</chapter-title>. In: <string-name><surname>Gradmann</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Borri</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Meghini</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Schuldt</surname>, <given-names>H.</given-names></string-name> (Eds.), <source>Proceedings of International Conference on Theory and Practice of Digital Libraries, Research and Advanced Technology for Digital Libraries (TPDL)</source>. <publisher-name>Springer</publisher-name>, <publisher-loc>Berlin</publisher-loc>, pp. <fpage>130</fpage>–<lpage>142</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Chávez</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Navarro</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Baeza-Yates</surname>, <given-names>R.A.</given-names></string-name>, <string-name><surname>Marroquín</surname>, <given-names>J.L.</given-names></string-name> (<year>2001</year>). <article-title>Searching in metric spaces</article-title>. <source>ACM Computing Surveys (CSUR 2001)</source>, <volume>33</volume>(<issue>3</issue>), <fpage>273</fpage>–<lpage>321</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Ciaccia</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Patella</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (<year>1997</year>). <chapter-title>M-tree: an efficient access method for similarity search in metric spaces</chapter-title>. In: <string-name><surname>Jarke</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Carey</surname>, <given-names>M.J.</given-names></string-name>, <string-name><surname>Dittrich</surname>, <given-names>K.R.</given-names></string-name>, <string-name><surname>Lochovsky</surname>, <given-names>F.H.</given-names></string-name>, <string-name><surname>Loucopoulos</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Jeusfeld</surname>, <given-names>M.A.</given-names></string-name> (Eds.), <source>Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997)</source>, <conf-loc>Athens, Greece, August 25–29, 1997</conf-loc>. <publisher-name>Morgan Kaufmann</publisher-name>, pp. <fpage>426</fpage>–<lpage>435</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_011">
<mixed-citation publication-type="book"><string-name><surname>Deepak</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Prasad</surname>, <given-names>M.D.</given-names></string-name> (<year>2015</year>). <source>Operators for Similarity Search: Semantics, Techniques and Usage Scenarios</source>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Esuli</surname>, <given-names>A.</given-names></string-name> (<year>2012</year>). <article-title>Use of permutation prefixes for efficient and scalable approximate similarity search</article-title>. <source>Information Processing &amp; Management</source>, <volume>48</volume>(<issue>5</issue>), <fpage>889</fpage>–<lpage>902</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Falchi</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Lucchese</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Orlando</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Perego</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Rabitti</surname>, <given-names>F.</given-names></string-name> (<year>2009</year>). <chapter-title>Caching content-based queries for robust and efficient image retrieval</chapter-title>. In: <source>Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09</source>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York</publisher-loc>, pp. <fpage>780</fpage>–<lpage>790</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Houle</surname>, <given-names>M.E.</given-names></string-name>, <string-name><surname>Nett</surname>, <given-names>M.</given-names></string-name> (<year>2015</year>). <article-title>Rank-based similarity search: reducing the dimensional dependence</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, <volume>37</volume>(<issue>1</issue>), <fpage>136</fpage>–<lpage>150</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_015">
<mixed-citation publication-type="chapter"><string-name><surname>Houle</surname>, <given-names>M.E.</given-names></string-name>, <string-name><surname>Sakuma</surname>, <given-names>J.</given-names></string-name> (<year>2005</year>). <chapter-title>Fast approximate similarity search in extremely high-dimensional data sets</chapter-title>. In: <source>Proceedins of 21st International Conference on Data Engineering (ICDE)</source>, pp. <fpage>619</fpage>–<lpage>630</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_016">
<mixed-citation publication-type="chapter"><string-name><surname>Jia</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Shelhamer</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Donahue</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Karayev</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Long</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Girshick</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Guadarrama</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Darrell</surname>, <given-names>T.</given-names></string-name> (<year>2014</year>). <chapter-title>Caffe: convolutional architecture for fast feature embedding</chapter-title>. In: <source>Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14</source>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York</publisher-loc>, pp. <fpage>675</fpage>–<lpage>678</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Lew</surname>, <given-names>M.S.</given-names></string-name>, <string-name><surname>Sebe</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Djeraba</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Jain</surname>, <given-names>R.</given-names></string-name> (<year>2006</year>). <article-title>Content-based multimedia information retrieval: State of the art and challenges</article-title>. <source>The ACM Transactions on Multimedia Computing, Communications, and Applications</source>, <volume>2</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>19</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_018">
<mixed-citation publication-type="other"><string-name><surname>Novak</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Batko</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (2011). Metric index: an efficient and scalable solution for precise and approximate similarity search. <italic>Information Systems</italic>, <italic>36</italic>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Novak</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name> (<year>2014</year>). <chapter-title>Rank aggregation of candidate sets for efficient similarity search</chapter-title>. In: <string-name><surname>Decker</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Lhotská</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Link</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Spies</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wagner</surname>, <given-names>R.R.</given-names></string-name> (Eds.), <source>Proceedings of 25th International Conference on Database and Expert Systems Applications (DEXA)</source>. <publisher-name>Springer International Publishing</publisher-name>, <publisher-loc>Cham</publisher-loc>, pp. <fpage>42</fpage>–<lpage>58</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_020">
<mixed-citation publication-type="chapter"><string-name><surname>Oliveira</surname>, <given-names>P.H.</given-names></string-name>, <string-name><surname>Traina</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Kaster</surname>, <given-names>D.S.</given-names></string-name> (<year>2015</year>). <chapter-title>Improving the pruning ability of dynamic metric access methods with local additional pivots and anticipation of information</chapter-title>. In: <source>Proceedings of 19th East European Conference on Advances in Databases and Information Systems (ADBIS)</source>, <conf-loc>Poitiers, France, September 8–11, 2015</conf-loc>. <publisher-name>Springer International Publishing</publisher-name>, <publisher-loc>Cham</publisher-loc>, pp. <fpage>18</fpage>–<lpage>31</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_021">
<mixed-citation publication-type="chapter"><string-name><surname>Pramanik</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ruan</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bhattacharjee</surname>, <given-names>S.K.</given-names></string-name> (<year>1999</year>). <chapter-title>Efficient search scheme for very large image databases</chapter-title>. In: <string-name><surname>Beretta</surname>, <given-names>G.B.</given-names></string-name>, <string-name><surname>Schettini</surname>, <given-names>R.</given-names></string-name> (Eds.), <source>Proceedings of the International Society for Optical Engineering (SPIE) on Internet Imaging</source>, <conf-loc>San Jose, California, USA, January 26, 2000</conf-loc>, Vol. <volume>3964</volume>. <publisher-name>The International Society for Optical Engineering</publisher-name>, pp. <fpage>79</fpage>–<lpage>90</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_022">
<mixed-citation publication-type="chapter"><string-name><surname>Sadit Tellez</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Chávez</surname>, <given-names>E.</given-names></string-name> (<year>2012</year>). <chapter-title>The list of clusters revisited</chapter-title>. In: <string-name><surname>Carrasco-Ochoa</surname>, <given-names>J.A.</given-names></string-name>, <string-name><surname>Martínez-Trinidad</surname>, <given-names>J.F.</given-names></string-name>, <string-name><surname>Olvera López</surname>, <given-names>J.A.</given-names></string-name>, <string-name><surname>Boyer</surname>, <given-names>K.L.</given-names></string-name> (Eds.), <source>Proceedings of 4th Mexican Conference on Pattern Recognition (MCPR)</source>. <publisher-name>Springer</publisher-name>, <publisher-loc>Berlin</publisher-loc>, pp. <fpage>187</fpage>–<lpage>196</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_023">
<mixed-citation publication-type="book"><string-name><surname>Samet</surname>, <given-names>H.</given-names></string-name> (<year>2006</year>). <source>Foundations of Multidimensional And Metric Data Structures. The Morgan Kaufmann Series in Data Management Systems</source>. <publisher-name>Morgan Kaufmann</publisher-name>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_024">
<mixed-citation publication-type="chapter"><string-name><surname>Skopal</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Hoksza</surname>, <given-names>D.</given-names></string-name> (<year>2007</year>). <chapter-title>Improving the performance of m-tree family by nearest-neighbor graphs</chapter-title>. In: <string-name><surname>Ioannidis</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Novikov</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Rachev</surname>, <given-names>B.</given-names></string-name> (Eds.), <source>Proceedings of 11th East European Conference on Advances in Databases and Information Systems (ADBIS)</source>, <conf-loc>Varna, Bulgaria, September 29–October 3, 2007</conf-loc>. <publisher-name>Springer</publisher-name>, <publisher-loc>Berlin</publisher-loc>, pp. <fpage>172</fpage>–<lpage>188</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Skopal</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Lokoc</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bustos</surname>, <given-names>B.</given-names></string-name> (<year>2012</year>). <article-title>D-cache: universal distance cache for metric access methods</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>24</volume>(<issue>5</issue>), <fpage>868</fpage>–<lpage>881</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_026">
<mixed-citation publication-type="chapter"><string-name><surname>Skopal</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Pokorný</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Snášel</surname>, <given-names>V.</given-names></string-name> (<year>2005</year>). <chapter-title>Nearest neighbours search using the PM-Tree</chapter-title>. In <series>Lecture Notes in Computer Science</series><italic>: Vol.</italic> <volume>3453</volume>. <source>Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA)</source>, <conf-loc>Beijing, China</conf-loc>, <conf-date>April 17–20, 2005</conf-date>. <publisher-name>Springer</publisher-name>, pp. <fpage>803</fpage>–<lpage>815</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Vilar</surname>, <given-names>J.M.</given-names></string-name> (<year>1995</year>). <article-title>Reducing the overhead of the AESA metric-space nearest neighbour searching algorithm</article-title>. <source>Information Processing Letters</source>, <volume>56</volume>(<issue>5</issue>), <fpage>265</fpage>–<lpage>271</lpage>.</mixed-citation>
</ref>
<ref id="j_info1130_ref_028">
<mixed-citation publication-type="book"><string-name><surname>Zezula</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Amato</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Dohnal</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Batko</surname>, <given-names>M.</given-names></string-name> (<year>2005</year>). <source>Similarity Search: The Metric Space Approach. Advances in Database Systems</source>, Vol. <volume>32</volume>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>