<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn>
<issn pub-type="ppub">0868-4952</issn>
<issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR413</article-id>
<article-id pub-id-type="doi">10.15388/20-INFOR413</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>An Effective Approach to Outlier Detection Based on Centrality and Centre-Proximity</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Bae</surname><given-names>Duck-Ho</given-names></name><xref ref-type="aff" rid="j_infor413_aff_001">1</xref><bio>
<p><bold>D.-H. Bae</bold> received his BS, MS, and PhD degrees in electronics and computer engineering from the Hanyang University, Seoul, Korea, in 2006, 2008, and 2013, respectively. Currently, he is a principle engineer at Samsung Electronics. His research interests include data mining, databases, and storage systems.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Jeong</surname><given-names>Seo</given-names></name><xref ref-type="aff" rid="j_infor413_aff_001">1</xref><bio>
<p><bold>S. Jeong</bold> received his BS in computer science and engineering from the Chung-ang University, Seoul, Korea, in 2009. He received MS in computer engineering from the Hanyang University, Seoul, Korea, in 2011. He was researching clustering and outlier detection methodologies. After graduation, he worked as a SDE for LG Electronics, EA Games, and is currently working for Amazon.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Hong</surname><given-names>Jiwon</given-names></name><xref ref-type="aff" rid="j_infor413_aff_001">1</xref><bio>
<p><bold>J. Hong</bold> received his BS in computer science from the Hanyang University, Seoul, Korea, in 2009. He is currently pursuing a PhD degree in computer and software at the Hanyang University. His research interests include data mining, database, social network analysis and recommender system.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Lee</surname><given-names>Minsoo</given-names></name><xref ref-type="aff" rid="j_infor413_aff_002">2</xref><bio>
<p><bold>M. Lee</bold> received his PhD degree from the University of Florida, and his MS and BS from the Department of Computer Science and Engineering, Seoul National University, in 2000, 1995, 1992, respectively. He is currently a professor at the Department of Computer Science and Engineering, Ewha Womans University, Seoul, Korea, since 2002. He worked for LG Electronics from 1995 to 1996. He also worked for Oracle Corporation in the US as a Senior Member of Technical Staff from 2000 to 2002. His research interests include data mining, data warehouse, web information infrastructures, stream data processing, and deep learning.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Ivanović</surname><given-names>Mirjana</given-names></name><xref ref-type="aff" rid="j_infor413_aff_003">3</xref><bio>
<p><bold>M. Ivanović</bold>, PhD since 2002, holds position of full professor at the Faculty of Sciences, University of Novi Sad, Serbia. She is a member of University Council for informatics for more than 10 years. Author or co-author of 13 textbooks, 13 edited proceedings, 3 monographs, and of more than 440 research papers on multi-agent systems, e-learning and web-based learning, applications of intelligent techniques (CBR, data and web mining), software engineering education, most of which are published in international journals and proceedings of high-quality international conferences. She is/was a member of Program Committees of more than 200 international conferences and General Chair and Program Committee Chair of numerous international conferences. Also she has been an invited speaker at several international conferences and a visiting lecturer in Australia, Thailand and China. As a leader and researcher she has participated in numerous international projects. Currently she is the editor-in-chief of <italic>Computer Science and Information Systems Journal</italic>.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Savić</surname><given-names>Miloš</given-names></name><xref ref-type="aff" rid="j_infor413_aff_003">3</xref><bio>
<p><bold>M. Savić</bold> is an assistant professor at the Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, where he received his BSc, MSc and PhD degrees in the field of computer science in 2010, 2011 and 2015, respectively. His research interests are in the field of complex network analysis, graph-based machine learning techniques and scientometrics.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Kim</surname><given-names>Sang-Wook</given-names></name><email xlink:href="wook@hanyang.ac.kr">wook@hanyang.ac.kr</email><xref ref-type="aff" rid="j_infor413_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>S.-W. Kim</bold> received the BS degree in computer engineering from Seoul National University, in 1989, and the MS and PhD degrees in computer science from the Korea Advanced Institute of Science and Technology (KAIST), in 1991 and 1994, respectively. From 1995 to 2003, he served as an associate professor at the Kangwon National University. In 2003, he joined the Hanyang University, Seoul, Korea, where he is currently a professor at the Department of Computer and Software and the director of the Brain-Korea-21-Plus research program. He is also leading a National Research Lab (NRL) Project funded by the National Research Foundation since 2015. His research interests include databases, data mining, multimedia information retrieval, social network analysis, recommendation, and web data analysis.</p></bio>
</contrib>
<aff id="j_infor413_aff_001"><label>1</label>Department of Computer Science, <institution>Hanyang University</institution>, <country>Republic of Korea</country></aff>
<aff id="j_infor413_aff_002"><label>2</label>Department of Computer Science and Engineering, <institution>Ewha Womans University</institution>, <country>Republic of Korea</country></aff>
<aff id="j_infor413_aff_003"><label>3</label>Department of Mathematics and Informatics, <institution>University of Novi Sad</institution>, <country>Serbia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2020</year></pub-date>
<pub-date pub-type="epub"><day>6</day><month>5</month><year>2020</year></pub-date><volume>31</volume><issue>3</issue><fpage>435</fpage><lpage>458</lpage>
<history>
<date date-type="received"><month>1</month><year>2019</year></date>
<date date-type="accepted"><month>3</month><year>2020</year></date>
</history>
<permissions><copyright-statement>© 2020 Vilnius University</copyright-statement><copyright-year>2020</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>In data mining research, outliers usually represent extreme values that deviate from other observations on data. The significant issue of existing outlier detection methods is that they only consider the object itself not taking its neighbouring objects into account to extract location features. In this paper, we propose an innovative approach to this issue. First, we propose the notions of <italic>centrality</italic> and <italic>centre-proximity</italic> for determining the degree of outlierness considering the distribution of all objects. We also propose a novel graph-based algorithm for outlier detection based on the notions. The algorithm solves the problems of existing methods, i.e. the problems of <italic>local density</italic>, <italic>micro-cluster</italic>, and <italic>fringe objects</italic>. We performed extensive experiments in order to confirm the effectiveness and efficiency of our proposed method. The obtained experimental results showed that the proposed method uncovers outliers successfully, and outperforms previous outlier detection methods.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>graph-based outlier detection</kwd>
<kwd>centrality</kwd>
<kwd>centre-proximity</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source xlink:href="https://doi.org/10.13039/501100014188">MSIT</funding-source>
<award-id>NRF-2020R1A2B5B03001960</award-id>
<award-id>NRF-2017M3C4A7069440</award-id>
<award-id>NRF-2017M3C4A7083678</award-id>
</award-group>
<funding-statement>This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Ministry of Science and ICT (MSIT) (No. NRF-2020R1A2B5B03001960) and also by the Next-Generation Information Computing Development Program through the NRF funded by the MSIT (No. NRF-2017M3C4A7069440 and No. NRF-2017M3C4A7083678). </funding-statement>
</funding-group>
</article-meta>
</front>
<body>
<sec id="j_infor413_s_001">
<label>1</label>
<title>Introduction</title>
<p>In contemporary research in different data-mining applications, outlier detection is a primary activity. Generally speaking, outliers represent extreme values that deviate from other observations on data. Often outliers are considered as a noise or an experimental error, however, they may bring some important information. Detected outliers usually are seen as potential candidates for abnormal data that may lead to model misspecification and incorrect results. However, in some situations, they can be considered as a novelty within the datasets. In other words, an outlier is an observation that diverges from an overall pattern on a sample. Two kinds of outliers can be distinguished: <italic>univariate</italic> and <italic>multivariate</italic>. Univariate outliers are connected to a distribution of values in a single feature space while multivariate outliers are connected to an <italic>n</italic>-dimensional space (of <italic>n</italic>-features). Outliers also can be classified as <italic>parametric</italic> or <italic>non-parametric</italic> as we consider a distribution of values for selected features.</p>
<p>In literature, there is no exact definition of an outlier as it highly depends on unseen assumptions about the data structure and the selected detection method. In our research, we concentrate on outliers as objects that are dissimilar to the majority of other objects (Barnett and Lewis, <xref ref-type="bibr" rid="j_infor413_ref_003">1994</xref>; Han and Kamber, <xref ref-type="bibr" rid="j_infor413_ref_014">2000</xref>; Hawkins, <xref ref-type="bibr" rid="j_infor413_ref_015">1980</xref>; Akoglu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_001">2010</xref>; Huang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_016">2016</xref>; Yerlikaya-Ö <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_029">2016</xref>; Domingues <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_010">2018</xref>; Kieu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_018">2018</xref>). Detection of such outliers is of high importance for different applications and it is particularly based on specific datasets. Typical application domains include detecting misuse of medicines, detecting network intrusions (Song <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_026">2013</xref>; Fanaee-T and Gama, <xref ref-type="bibr" rid="j_infor413_ref_011">2016</xref>), and detecting economic and financial frauds (Chan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_008">2010</xref>; Friedman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_012">2007</xref>).</p>
<p>To illustrate our approach and understanding of outliers in this paper we will illustrate it visually. In Fig. <xref rid="j_infor413_fig_001">1</xref>, a 2-dimensional dataset is shown. There are many objects with characteristics similar to those of object <inline-formula id="j_infor413_ineq_001"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> (for example, object <inline-formula id="j_infor413_ineq_002"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{4}}$]]></tex-math></alternatives></inline-formula>), and these objects are considered to be “<italic>normal objects</italic>”. A set of similar normal objects could form a <italic>cluster</italic> (Widyantoro <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_028">2002</xref>). An object <inline-formula id="j_infor413_ineq_003"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{2}}$]]></tex-math></alternatives></inline-formula> is dissimilar to normal objects. In our paper, we will consider and detect such objects as <italic>outliers</italic>. Contrary to object <inline-formula id="j_infor413_ineq_004"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{2}}$]]></tex-math></alternatives></inline-formula>, object <inline-formula id="j_infor413_ineq_005"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{3}}$]]></tex-math></alternatives></inline-formula> and a small number of its neighbouring objects possess some similar characteristics, forming a small cluster. The objects in this small cluster are rather dissimilar from the majority of the objects. Accordingly, we will define this small cluster as a <italic>micro-cluster</italic> as object <inline-formula id="j_infor413_ineq_006"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{3}}$]]></tex-math></alternatives></inline-formula> and its neighbouring objects form some kind of outliers (Papadimitriou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_024">2003</xref>). Objects such as <inline-formula id="j_infor413_ineq_007"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{4}}$]]></tex-math></alternatives></inline-formula> are normal but also are distant from the core of the cluster they belong to. We call such objects <italic>fringe objects</italic>. Fringe objects should not be mixed with real outliers (Papadimitriou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_024">2003</xref>). Researchers have to be aware of the fact that normal objects and outliers could not be differentiated using some <italic>absolute criteria</italic>. A variety of <italic>relative criteria</italic> should be considered as nature and the characteristics of datasets are different like: size of the dataset, the distribution of objects in the dataset, and so on.</p>
<fig id="j_infor413_fig_001">
<label>Fig. 1</label>
<caption>
<p>A 2-dimensional dataset.</p>
</caption>
<graphic xlink:href="infor413_g001.jpg"/>
</fig>
<p>As nowadays outlier detection is a challenging research area, a wide range of methods is used (Bay and Schwabacher, <xref ref-type="bibr" rid="j_infor413_ref_004">2003</xref>; Böhm <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_005">2009</xref>; Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>; Knorr and Ng, <xref ref-type="bibr" rid="j_infor413_ref_020">1999</xref>; Knorr <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_021">2000</xref>; Moonesinghe and Tan, <xref ref-type="bibr" rid="j_infor413_ref_022">2008</xref>; Papadimitriou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_024">2003</xref>; Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>). The majority of these methods use their own <italic>object location features</italic> which reflect the relative characteristics of each object over the distribution of all objects in the dataset. The standard flow of activities is presented below:</p>
<list>
<list-item id="j_infor413_li_001">
<label>•</label>
<p><bold>Step 1.</bold> Within the methods, the first activity is to compute location features of each object in the dataset.</p>
</list-item>
<list-item id="j_infor413_li_002">
<label>•</label>
<p><bold>Step 2.</bold> After that, an <italic>outlierness score</italic> is assigned to the object that reflect those location features. In fact, an outlierness score of an object <italic>i</italic> indicates how much a different object <italic>i</italic> is compared to other objects in the dataset.</p>
</list-item>
<list-item id="j_infor413_li_003">
<label>•</label>
<p><bold>Step 3.</bold> Finally, the methods consider as outliers the top <italic>m</italic> objects with the highest outlierness score.</p>
</list-item>
</list>
<p>The significant issue of existing outlier detection methods is that they extract location features only taking into account <italic>the characteristics of the object itself</italic>. They do not pay attention to <italic>the characteristics of its neighbouring objects</italic> nor the distribution of <italic>all of the objects</italic> in the dataset. Therefore, the location features of each object are highly dependent on the parameters given by users. This is the main reason for the well-known local density and micro-cluster problems (Papadimitriou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_024">2003</xref>). The key issue of these methods is that they are highly based on the location features of objects. Consequently, the local density problem is a phenomenon that outliers are determined only depending on its local density. The essential problem is the micro-cluster problem as in such circumstances micro-clusters cannot be detected as outliers.</p>
<p>Additional quality of some methods is that when calculating the outlierness score of each object, they consider not only the location features of the object itself but also the location features of its neighbouring objects (Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>). However, because of the fundamental limitations of their location features that do not consider the distributions of <italic>all</italic> objects in the dataset, they still have the micro-cluster problem.</p>
<p>Having in mind the above mentioned limitations of existing methods, in this paper we are going to propose a novel approach.<xref ref-type="fn" rid="j_infor413_fn_001">1</xref><fn id="j_infor413_fn_001"><label><sup>1</sup></label>
<p>A preliminary version of this work has been presented as a poster (4 pages) in ACM CIKM 2012 (Bae <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_002">2012</xref>). This paper has additional contributions compared with its preliminary version listed as follows: (1) we define four goals of the proposed method to clarify its advantages; (2) we propose three graph models for our method and compare their effectiveness via extensive experiments; (3) we propose two weighting schemes for our method and compare their effectiveness via extensive experiments; (4) we show the results obtained by applying the proposed method to a real-world NBA dataset; (5) we show the superiority of the proposed method over more competitors.</p></fn> First, we propose the notions of <italic>centrality</italic> and <italic>centre-proximity</italic> as novel relative location features that can take into consideration the characteristics of <italic>all</italic> the objects in the dataset, rather than only the characteristics of the object itself. Furthermore, we propose a graph-based method for outlier detection that calculates the outlierness score of each object using these two newly introduced relative location features. Our method models a given dataset as a <italic>k</italic>-NN graph and calculates the centrality and centre-proximity scores of each object from the <italic>k</italic>-NN graph iteratively. To show the effectiveness of our method, we carried out a variety of experiments. The obtained results are promising and also compared with the previous outlier detection methods, our method achieves the best precision.</p>
<p>The rest of the paper is organized as follows. Section <xref rid="j_infor413_s_002">2</xref> is devoted to some existing outlier detection methods and points out their deficiencies. Section <xref rid="j_infor413_s_007">3</xref> brings an overview of our proposed method and outlines its main four goals. Technical details of our method are presented in Section <xref rid="j_infor413_s_008">4</xref>. In Section <xref rid="j_infor413_s_009">5</xref>, we explain graph modelling schemes for our graph-based outlier detection. Experimental results of performance analysis to verify the superiority of our method are presented in Section <xref rid="j_infor413_s_017">6</xref>. Finally, Section <xref rid="j_infor413_s_028">7</xref> brings concluding remarks and possibilities for future work.</p>
</sec>
<sec id="j_infor413_s_002">
<label>2</label>
<title>Related Work</title>
<p>In this section, we first introduce previous outlier detection methods, which are statistics-based, distance-based, density-based, and RWR-based methods. We then point out their problems in outlier detection.</p>
<sec id="j_infor413_s_003">
<label>2.1</label>
<title>Statistics-Based Outlier Detection</title>
<p>The statistics-based outlier detection method finds the most suitable statistical distribution model for the distribution of objects in the given dataset and detects the objects that deviate from the statistical distribution model as outliers (Chandola <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_009">2009</xref>).</p>
<p>If the dataset is generated under a specific statistical distribution model, the statistics-based outlier detection method will work well as a feasible solution to the detection of outliers. However, the problem is that most real-world data is not generated from a specific statistical distribution model (Chandola <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_009">2009</xref>; Knorr and Ng, <xref ref-type="bibr" rid="j_infor413_ref_020">1999</xref>; Knorr <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_021">2000</xref>). Additionally, when detecting outliers from a multi-dimensional dataset, it is very difficult to find a statistical distribution model that can fit the distribution of almost all objects in the dataset in terms of all dimensions (Chandola <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_009">2009</xref>; Knorr and Ng, <xref ref-type="bibr" rid="j_infor413_ref_020">1999</xref>; Knorr <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_021">2000</xref>).</p>
</sec>
<sec id="j_infor413_s_004">
<label>2.2</label>
<title>Distance-Based Outlier Detection</title>
<p>The distance-based outlier detection method uses the <italic>distance among objects</italic> as a location feature, and detects as outliers the ones whose distance to other objects exceed a specific threshold (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>). Here, the threshold is a user-defined parameter.</p>
<p>There have been several previous studies on distance-based outlier detection. The <italic>DB-outlier</italic> method (Knorr and Ng, <xref ref-type="bibr" rid="j_infor413_ref_020">1999</xref>; Knorr <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_021">2000</xref>) uses the number of other objects near the target object <italic>o</italic> to find out whether <italic>o</italic> is an outlier or not. If there are less than <italic>p</italic> objects within the distance <italic>d</italic> from <italic>o</italic>, <italic>o</italic> is considered an outlier. <italic>k-Dist</italic> method (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>) uses the <italic>k</italic>-Dist as a location feature, and detects the top <italic>m</italic> objects having the largest <italic>k</italic>-Dist value as outliers. Here, the <italic>k</italic>-Dist is the distance from each object to its <italic>k</italic>-th nearest object.</p>
<p>Distance-based outlier detection methods use the features that only consider <italic>the characteristics of the object itself</italic>. Thus, it could cause the <italic>local density problem</italic> (Papadimitriou <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_024">2003</xref>). Figure <xref rid="j_infor413_fig_002">2</xref> shows an example of a 2-dimensional dataset with the local density problem. The dataset has two clusters (<inline-formula id="j_infor413_ineq_008"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor413_ineq_009"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula>) and an outlier <inline-formula id="j_infor413_ineq_010"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula>. To classify object <inline-formula id="j_infor413_ineq_011"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> as an outlier, DB-outlier should set <inline-formula id="j_infor413_ineq_012"><alternatives>
<mml:math><mml:mi mathvariant="italic">d</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[$d={d_{1}}$]]></tex-math></alternatives></inline-formula>. However, normal objects inside <inline-formula id="j_infor413_ineq_013"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> also are classified as outliers with <inline-formula id="j_infor413_ineq_014"><alternatives>
<mml:math><mml:mi mathvariant="italic">d</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[$d={d_{1}}$]]></tex-math></alternatives></inline-formula>. If DB-outlier set <inline-formula id="j_infor413_ineq_015"><alternatives>
<mml:math><mml:mi mathvariant="italic">d</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[$d={d_{2}}$]]></tex-math></alternatives></inline-formula> to classify objects in <inline-formula id="j_infor413_ineq_016"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> as normal objects, then <inline-formula id="j_infor413_ineq_017"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> also is classified as a normal object.</p>
<fig id="j_infor413_fig_002">
<label>Fig. 2</label>
<caption>
<p>An example of the local density problem.</p>
</caption>
<graphic xlink:href="infor413_g002.jpg"/>
</fig>
</sec>
<sec id="j_infor413_s_005">
<label>2.3</label>
<title>Density-Based Outlier Detection</title>
<p>The density-based outlier detection methods classify an object as an outlier by considering the <italic>relative density</italic> of the target object (Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>). The <italic>density</italic> of an object could be decided by the number of objects around it.</p>
<p>The <italic>LOF</italic> method (Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>; Na <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_023">2018</xref>) is a typical example of the density-based outlier detection. The LOF method defines the density of an object as the average reachability distance from the object to its <italic>k</italic>-nearest objects. Here, the reachability distance from object <italic>p</italic> to object <italic>q</italic> is a large value between the distance of the two objects and the distance between <italic>q</italic> and the <italic>k</italic>-th nearest object of <italic>q</italic>. In the LOF method, an object <italic>p</italic> has a high outlierness score if the density of <italic>p</italic> is lower than that of <italic>p</italic>’s neighbouring objects.</p>
<fig id="j_infor413_fig_003">
<label>Fig. 3</label>
<caption>
<p>An example of the micro-cluster problem.</p>
</caption>
<graphic xlink:href="infor413_g003.jpg"/>
</fig>
<p>The density-based methods consider the characteristics of the target object itself by comparing it with <italic>its neighbouring objects</italic>. By considering neighbouring objects altogether, the local density problem is prevented. However, in density-based methods there still exists the micro-cluster problem (Moonesinghe and Tan, <xref ref-type="bibr" rid="j_infor413_ref_022">2008</xref>).</p>
<p>Figure <xref rid="j_infor413_fig_003">3</xref> shows an example of a 2-dimensional dataset with the micro-cluster problem. The dataset includes two clusters (<inline-formula id="j_infor413_ineq_018"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor413_ineq_019"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula>) and one micro cluster (<inline-formula id="j_infor413_ineq_020"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{3}}$]]></tex-math></alternatives></inline-formula>). The density of an object <inline-formula id="j_infor413_ineq_021"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> inside <inline-formula id="j_infor413_ineq_022"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{3}}$]]></tex-math></alternatives></inline-formula> is similar to that of its neighbouring objects which are also inside the micro-cluster <inline-formula id="j_infor413_ineq_023"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{3}}$]]></tex-math></alternatives></inline-formula>. In such a case that the outliers form a micro-cluster, the density-based outlier methods may classify them as normal objects.</p>
<p>This problem arises because they take only the neighbouring objects into account when determining if an object is an outlier. Thus, we should also consider the characteristics of <italic>all the objects</italic> in the dataset to overcome the micro-cluster problem and the local-density problem.</p>
</sec>
<sec id="j_infor413_s_006">
<label>2.4</label>
<title>RWR-Based Outlier Detection</title>
<p>The RWR-based outlier detection methods (Moonesinghe and Tan, <xref ref-type="bibr" rid="j_infor413_ref_022">2008</xref>) model a given dataset as a graph, and perform the Random Walk with Restart (RWR) (Brin and Page, <xref ref-type="bibr" rid="j_infor413_ref_007">1998</xref>) on the modelled graph to detect outliers. Here, the inversed number of the RWR score of each object is considered as its outlierness score.</p>
<p>There are two methods in RWR-based outlier detections. The <italic>Outrank-a</italic> method models a given dataset as a complete weighted graph where weights are the similarity between every pair of objects and the <italic>Outrank-b</italic> method, from the same complete weighted graph, deletes the edges with the similarity lower than a specific threshold <italic>t</italic>. Because these methods use a graph, they have the advantage that the characteristics of all the objects could be considered when deciding whether or not an object is an outlier.</p>
<p>However, in the RWR-based methods, the RWR score is transferred through a directed edge in a <italic>single direction</italic>, and they cannot differentiate the fringe objects from the outlier objects. Furthermore, the Outrank-a method, which models a dataset as a complete graph, <italic>directly</italic> considers the characteristics of all the other objects. As a result, the precision of outlier detection is low. In the case of Outrank-b, the precision of outlier detection is greatly affected by the user-defined parameter value <italic>t</italic>. These problems will be discussed in Sections <xref rid="j_infor413_s_008">4</xref> and <xref rid="j_infor413_s_009">5</xref> in more detail.</p>
</sec>
</sec>
<sec id="j_infor413_s_007">
<label>3</label>
<title>Overview</title>
<p>In this section, we define our goals of outlier detection and explain our strategies to achieve the goals. The following is our list of goals in detecting outliers.</p>
<list>
<list-item id="j_infor413_li_004">
<label>•</label>
<p><italic>Goal 1: Outliers should be accurately detected</italic>. The proposed method should avoid the local density problem and the micro-cluster problem. Furthermore, the fringe objects in clusters should not be classified as outliers.</p>
</list-item>
<list-item id="j_infor413_li_005">
<label>•</label>
<p><italic>Goal 2: For every object in a dataset, the proposed method should not only determine if it is an outlier or not, but also provide an outlierness score of each object to the user.</italic> The user can use this score to understand how much an object deviates from normal objects and can decide the number of outliers intuitively. Moreover, the user also can observe the changes in the outlierness score of each object occurred by the change in the user-defined parameter values. This provides the user with hints on setting the parameter values.</p>
</list-item>
<list-item id="j_infor413_li_006">
<label>•</label>
<p><italic>Goal 3: The proposed method should be able to handle data of any type and/or any form.</italic> The proposed method should be applicable to multi-dimensional data, non-coordinate data, and more, in contrast to the statistics-based outlier detection method that has limitations on the applicable data form.</p>
</list-item>
<list-item id="j_infor413_li_007">
<label>•</label>
<p><italic>Goal 4: The number of user-defined parameters should be as small as possible, and the fluctuation of the precision by the change in user-defined parameter values should be small.</italic> In general circumstances, the user would not have a perfect understanding on a given dataset. Thus, our proposed method should not be sensitive to its parameters and provide reasonable accuracy with non-optimal parameters.</p>
</list-item>
</list>
<p>We propose a novel outlier detection method to achieve the goals above. First, we propose two novel location features called <italic>centrality</italic> and <italic>centre-proximity</italic>. Both features could make our proposed method avoid the local-density problem and the micro-cluster problem by taking the characteristics of <italic>each object</italic>, <italic>its neighbouring objects</italic>, and <italic>all the objects</italic> in the dataset into account (Goal 1). Centrality and centre-proximity also provide robustness on parameters to our proposed method (Goal 4).</p>
<fig id="j_infor413_fig_004">
<label>Fig. 4</label>
<caption>
<p>Overview of the proposed method (<inline-formula id="j_infor413_ineq_024"><alternatives>
<mml:math><mml:mi mathvariant="italic">k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math>
<tex-math><![CDATA[$k=1$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor413_ineq_025"><alternatives>
<mml:math><mml:mi mathvariant="italic">m</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math>
<tex-math><![CDATA[$m=5$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<graphic xlink:href="infor413_g004.jpg"/>
</fig>
<p>Second, we build a <italic>k</italic>-NN graph from a given dataset and analyse the modelled graph to compute the outlierness score for each object. Thus, our method provides users with an outlierness score of every object (Goal 2). It is noteworthy that multidimensional data and/or non-coordinate data can also be modelled as a graph. Therefore, by using such a strategy, we can relax the constraints on the input data types and/or forms (Goal 3).</p>
<p>As shown in Fig. <xref rid="j_infor413_fig_004">4</xref>, the proposed method has three steps to detect outliers.</p>
<list>
<list-item id="j_infor413_li_008">
<label>1.</label>
<p>Build a <italic>k</italic>-NN graph from a given dataset. Each object in the dataset is represented as a node, the <italic>k</italic>-NN relationships connecting each object to its <italic>k</italic>-nearest objects are represented as <italic>directed</italic> edges. Each edge has the weight value of the similarity between the two nodes it connects.</p>
</list-item>
<list-item id="j_infor413_li_009">
<label>2.</label>
<p>For each object in the dataset, calculate the centrality and centre-proximity scores on the <italic>k</italic>-NN graph. Using both scores, compute the outlierness score of each node.</p>
</list-item>
<list-item id="j_infor413_li_010">
<label>3.</label>
<p>Return the top <italic>m</italic> objects with the highest outlierness scores as outliers.</p>
</list-item>
</list>
</sec>
<sec id="j_infor413_s_008">
<label>4</label>
<title>Centrality and Centre-Proximity</title>
<p>An object near the centre of a cluster is more likely to have many neighbouring objects than outliers. The distances between such an object and its neighbouring objects would be closer than the distances between an outlier object and its neighbouring objects. We propose two novel location features representing above properties that differentiate the normal objects and the outlier objects: the <italic>centrality</italic> and <italic>centre-proximity</italic>.</p>
<p>To calculate centrality and centre-proximity, we convert the given dataset into a graph as in existing RWR-based outlier detection methods. Although there are different ways to build a graph out of a given dataset, we use the <italic>k</italic>-NN graph for the conversion. We discuss different graph modelling schemes in Section <xref rid="j_infor413_s_009">5</xref>.</p>
<p>Centrality represents how much other objects in the dataset recognize a target object <italic>p</italic> as one of the core objects within the cluster. Centre-proximity represents how close a target object <italic>p</italic> is to the core objects within the cluster.</p>
<p>The centrality score can be measured by <italic>how many other objects recognize the target object p as their neighbour</italic>. The influence of each object that recognizes <italic>p</italic> as its neighbour is proportional to its centre-proximity and how much close it is to <italic>p</italic>. On the other hand, the centre-proximity score can be measured by <italic>how many other objects are recognized by p as its neighbour</italic>. The influence of each object recognized by <italic>p</italic> as its neighbour is also proportional to its centrality and how much close it is to <italic>p</italic>.</p>
<p>We compute the two scores by iteratively referring to each other. The centrality and centre-proximity of node <italic>p</italic> can be calculated in the <italic>i</italic>-th iteration as follows: <disp-formula-group id="j_infor413_dg_001">
<disp-formula id="j_infor413_eq_001">
<label>(1)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt"><mml:mtr><mml:mtd class="align-odd"/><mml:mtd class="align-even"><mml:msub><mml:mrow><mml:mi mathvariant="italic">Centrality</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mstyle displaystyle="true"><mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi mathvariant="italic">q</mml:mi><mml:mo stretchy="false">∈</mml:mo><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi mathvariant="italic">w</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">q</mml:mi><mml:mo stretchy="false">→</mml:mo><mml:mi mathvariant="italic">p</mml:mi></mml:mrow></mml:msub><mml:mo>∗</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">Centre</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">Proximity</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo mathvariant="normal">,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[\begin{aligned}{}& {\mathit{Centrality}_{i}}(p)=\sum \limits_{q\in \mathit{In}(p)}{w_{q\to p}}\ast \frac{{\mathit{Centre}\text{-}\mathit{Proximity}_{i-1}}(q)}{{Z_{\mathit{Out}(q)}}},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor413_eq_002">
<label>(2)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt"><mml:mtr><mml:mtd class="align-odd"/><mml:mtd class="align-even"><mml:msub><mml:mrow><mml:mi mathvariant="italic">Centre</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">Proximity</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mstyle displaystyle="true"><mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi mathvariant="italic">q</mml:mi><mml:mo stretchy="false">∈</mml:mo><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi mathvariant="italic">w</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">p</mml:mi><mml:mo stretchy="false">→</mml:mo><mml:mi mathvariant="italic">q</mml:mi></mml:mrow></mml:msub><mml:mo>∗</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">Centrality</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo mathvariant="normal">,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[\begin{aligned}{}& {\mathit{Centre}\text{-}\mathit{Proximity}_{i}}(p)=\sum \limits_{q\in \mathit{Out}(p)}{w_{p\to q}}\ast \frac{{\mathit{Centrality}_{i-1}}(q)}{{Z_{\mathit{In}(q)}}},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group> where <inline-formula id="j_infor413_ineq_026"><alternatives>
<mml:math><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math>
<tex-math><![CDATA[$\mathit{In}(p)$]]></tex-math></alternatives></inline-formula> is a set of objects that point to <italic>p</italic>, and <inline-formula id="j_infor413_ineq_027"><alternatives>
<mml:math><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math>
<tex-math><![CDATA[$\mathit{Out}(p)$]]></tex-math></alternatives></inline-formula> is a set of objects that <italic>p</italic> points to. <inline-formula id="j_infor413_ineq_028"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">w</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">p</mml:mi><mml:mo stretchy="false">→</mml:mo><mml:mi mathvariant="italic">q</mml:mi></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${w_{p\to q}}$]]></tex-math></alternatives></inline-formula> is the weight assigned to the edge from <italic>p</italic> to <italic>q</italic>. <inline-formula id="j_infor413_ineq_029"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${Z_{\mathit{Out}(q)}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_infor413_ineq_030"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${Z_{\mathit{In}(q)}}$]]></tex-math></alternatives></inline-formula> are normalization factors; <inline-formula id="j_infor413_ineq_031"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${Z_{\mathit{Out}(q)}}$]]></tex-math></alternatives></inline-formula> is the sum of all the weights assigned to the edges from <italic>q</italic> to <inline-formula id="j_infor413_ineq_032"><alternatives>
<mml:math><mml:mi mathvariant="italic">Out</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math>
<tex-math><![CDATA[$\mathit{Out}(q)$]]></tex-math></alternatives></inline-formula>, while <inline-formula id="j_infor413_ineq_033"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">Z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${Z_{\mathit{In}(q)}}$]]></tex-math></alternatives></inline-formula> is the sum of all weights assigned to the edges from <inline-formula id="j_infor413_ineq_034"><alternatives>
<mml:math><mml:mi mathvariant="italic">In</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">q</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math>
<tex-math><![CDATA[$\mathit{In}(q)$]]></tex-math></alternatives></inline-formula> to <italic>q</italic>.</p>
<p>As the two equations above show, centrality and centre-proximity scores have a <italic>mutual reinforcement relationship</italic> with each other. A high centreality object makes the centre-proximity scores of the objects regarding it as their neighbour increase. Respectively, a high centre-proximity object makes the centrality scores of the objects regarded as its neighbour increase. This mutual reinforcement relationship between the centrality and centre-proximity scores is similar to that between the hub and authority scores in HITS (Kleinberg, <xref ref-type="bibr" rid="j_infor413_ref_019">1999</xref>).</p>
<p>Second, the centrality and centre-proximity scores of an object have the influence on its neighbouring objects in proportion to the weights on the edges. This means that an object has a larger influence on an object close to itself than a far apart object. This concept is unique with our approach and is not reflected in the mutual reinforcement of HITS.</p>
<p>The detailed procedure for calculating the two scores is shown in Fig. <xref rid="j_infor413_fig_005">5</xref>. The number of iterations (MAX_ITERATIONS) decides how far an object influences other objects in calculating the centrality and centre-proximity scores. If the influence of only directly connected neighbours is to be considered, we set MAX_ITERATIONS as 1. On the other hand, if the influence of all the objects in the dataset needs to be considered, we set MAX_ITERATIONS as the diameter of the modelled graph (Ha <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_013">2011</xref>).</p>
<fig id="j_infor413_fig_005">
<label>Fig. 5</label>
<caption>
<p>Procedure of calculating the outlierness scores.</p>
</caption>
<graphic xlink:href="infor413_g005.jpg"/>
</fig>
<p>By iteratively calculating the centrality and centre-proximity scores, each object gets the centrality and centre-proximity scores that are affected by its indirect neighbours. With sufficiently large MAX_ITERATIONS (i.e. MAX_ITERATIONS equal to or greater than the diameter of the graph), both scores are affected by all the objects in the dataset. This makes the proposed outlier detection method prevent both local density and micro-cluster problems.</p>
<p>The outliers get low centrality and centre-proximity scores compared to those of the normal objects. The fringe objects may have low centrality scores as the outliers but their centre-proximity scores would be generally higher than that of the outliers. To fulfill Goal 1, we use <italic>the inverse of the converged centre-proximity score</italic> of each object as the final outlierness score. The outlierness score of node <italic>p</italic> can be calculated as follows: 
<disp-formula id="j_infor413_eq_003">
<label>(3)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="italic">Outlierness</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">ConvergedˍCenter</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">Proximity</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">p</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[ \mathit{Outlierness}(p)=\frac{1}{\mathit{Converged\_ Center}\text{-}\mathit{Proximity}(p)}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Figure <xref rid="j_infor413_fig_006">6</xref> shows the centrality and centre-proximity scores assigned to each object on a 2-NN graph where edge <inline-formula id="j_infor413_ineq_035"><alternatives>
<mml:math><mml:mi mathvariant="italic">x</mml:mi><mml:mo stretchy="false">→</mml:mo><mml:mi mathvariant="italic">y</mml:mi></mml:math>
<tex-math><![CDATA[$x\to y$]]></tex-math></alternatives></inline-formula> indicates <italic>x</italic> thinks <italic>y</italic> is its neighbour and the thickness of each edge indicates how much close the two nodes are. The object <italic>c</italic> is closest to the core of the cluster. Thus it has the highest centrality and centre-proximity scores. The fringe object <italic>d</italic> and the outlier object <italic>a</italic> do not have any objects that recognize them as the cluster centre, their centrality scores are both 0. However, <italic>d</italic> has a higher centre-proximity score (= 0.313) than <italic>a</italic> (= 0.128) because it is closer to the cluster centre. As a result, we are able to identify <italic>a</italic> as an outlier by using the inverse of the centre-proximity score as the outlierness score.</p>
<fig id="j_infor413_fig_006">
<label>Fig. 6</label>
<caption>
<p>Centrality and centre-proximity scores assigned to objects in a 2-NN graph.</p>
</caption>
<graphic xlink:href="infor413_g006.jpg"/>
</fig>
<p>The RWR score, the location feature in the RWR-based outlier detection method, considers (1) how many objects point to an object, and (2) how many objects exist around the object. Therefore, the RWR score is similar in concepts to the centrality score, and thus it cannot differentiate the fringe objects and the outlier objects.</p>
<p>The proposed method has an <inline-formula id="j_infor413_ineq_036"><alternatives>
<mml:math><mml:mi mathvariant="italic">O</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">E</mml:mi><mml:mo>∗</mml:mo><mml:mi mathvariant="italic">i</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math>
<tex-math><![CDATA[$O(E\ast i)$]]></tex-math></alternatives></inline-formula> time complexity for calculating the outlierness score of each object. Here, <italic>E</italic> is the total number of edges in the modelled graph, and <italic>i</italic> is the number of iterations for calculating the centrality and centre-proximity scores until they converge.</p>
</sec>
<sec id="j_infor413_s_009">
<label>5</label>
<title>Graph Modelling</title>
<p>We model a graph based on the given dataset to easily calculate the centrality and centre-proximity scores. The neighbour relationship between two objects can be represented as an edge in the graph. In Section <xref rid="j_infor413_s_010">5.1</xref>, we introduce the three graph modelling schemes to model a given dataset, and then analyse the effect of each scheme on our outlier detection method.</p>
<p>The centrality and centre-proximity scores of an object have influence on its neighbouring objects in proportion to the weights on the edges. The weight should have a high score if two objects have similar characteristics (i.e. located close to each other) and should have a low score if two objects are different (i.e. located far apart). The details of the weight assignment methods are discussed in Section <xref rid="j_infor413_s_014">5.2</xref>.</p>
<sec id="j_infor413_s_010">
<label>5.1</label>
<title>Graph Models</title>
<p>In this section, we analyse three graph modelling schemes, (1) a complete graph, (2) an <italic>e</italic>-NN graph, and (3) a <italic>k</italic>-NN graph, for our proposed outlier detection. The three graph modelling schemes are the same in representing an object as a node, and are different only in the way they connect nodes with edges. Note that we can model a given dataset as different graph modelling schemes such as minimum spanning tree graphs, <italic>k</italic>-NN graph variations, and <italic>e</italic>-<italic>k</italic>-NN graphs. However, we employ the three representative (base-formed) graph modelling schemes having quite different characteristics with each other.</p>
<sec id="j_infor413_s_011">
<label>5.1.1</label>
<title>Complete Graph</title>
<p>A complete graph connects each node in the dataset to every other node with a directed edge. Due to this, the centrality and centre-proximity scores of an object are directly affected by all the other objects in the graph.</p>
<p>In the complete graph, the centrality and centre-proximity scores of each object show a difference only according to the weight values. Especially, the weights of the edges connected to objects located at the centre of gravity in the whole graph have the largest values. Therefore, the objects located at the centre of gravity have the highest centrality and centre-proximity scores with no relationship to whether they are outliers or not. Figure <xref rid="j_infor413_fig_007">7</xref> shows a 2-dimensional dataset. The dataset includes two clusters (<inline-formula id="j_infor413_ineq_037"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_infor413_ineq_038"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula>) and an outlier <inline-formula id="j_infor413_ineq_039"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula>. Suppose that we model a dataset as a complete graph. In this graph, the outlier <inline-formula id="j_infor413_ineq_040"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> has the highest centrality and centre-proximity scores because it is located at the centre of gravity in the graph. As a result, <inline-formula id="j_infor413_ineq_041"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> is not detected as an outlier in this case.</p>
<fig id="j_infor413_fig_007">
<label>Fig. 7</label>
<caption>
<p>An example showing the problem with a complete graph.</p>
</caption>
<graphic xlink:href="infor413_g007.jpg"/>
</fig>
<p>Additionally, in the complete graph, the number of edges is the square of the total number of objects in the dataset, and therefore the performance of computing the centrality and center-proximity scores could deteriorate significantly.</p>
</sec>
<sec id="j_infor413_s_012">
<label>5.1.2</label>
<title><italic>e</italic>-NN Graph</title>
<p>An <italic>e</italic>-NN graph modelling scheme connects an object with other objects that exist within a specific distance denoted as <italic>e</italic> as shown in Fig. <xref rid="j_infor413_fig_008">8</xref>. Due to this, the <italic>e</italic>-NN graphs for the same dataset could differ greatly when the value of <italic>e</italic> changes. In Fig. <xref rid="j_infor413_fig_008">8</xref>, if <italic>e</italic> is set to a value smaller than <inline-formula id="j_infor413_ineq_042"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">e</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${e_{1}}$]]></tex-math></alternatives></inline-formula>, all of the objects that should be included in the cluster <inline-formula id="j_infor413_ineq_043"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{2}}$]]></tex-math></alternatives></inline-formula> are all separated and have very low centrality and centre-proximity scores. On the other hand, if the value of <italic>e</italic> is set larger than <inline-formula id="j_infor413_ineq_044"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">e</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${e_{2}}$]]></tex-math></alternatives></inline-formula>, the outlier <inline-formula id="j_infor413_ineq_045"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${o_{1}}$]]></tex-math></alternatives></inline-formula> is included in cluster <inline-formula id="j_infor413_ineq_046"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${c_{1}}$]]></tex-math></alternatives></inline-formula> and has a high centre-proximity score. This is similar to the local density problem that occurs in the distance-based outlier detection method.</p>
<fig id="j_infor413_fig_008">
<label>Fig. 8</label>
<caption>
<p>An example of the local density problem with an <italic>e</italic>-NN graph.</p>
</caption>
<graphic xlink:href="infor413_g008.jpg"/>
</fig>
<p>As a result, in the <italic>e</italic>-NN graph, the precision of the outlier detection changes considerably when <italic>e</italic> changes. Therefore, the user should understand the distribution of the dataset very well in order to set the proper value for <italic>e</italic>, which is quite difficult in practice.</p>
</sec>
<sec id="j_infor413_s_013">
<label>5.1.3</label>
<title><italic>k</italic>-NN Graph</title>
<p>A <italic>k</italic>-NN graph modelling scheme connects each object to its <italic>k</italic> nearest objects with a directed edge as shown in Fig. <xref rid="j_infor413_fig_009">9</xref>. Therefore, in the <italic>k</italic>-NN graph, each object directly affects its nearest <italic>k</italic> objects.</p>
<fig id="j_infor413_fig_009">
<label>Fig. 9</label>
<caption>
<p>An example of a distinguishable outlier problem with a directed <italic>k</italic>-NN graph (<inline-formula id="j_infor413_ineq_047"><alternatives>
<mml:math><mml:mi mathvariant="italic">k</mml:mi><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:math>
<tex-math><![CDATA[$k=3$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<graphic xlink:href="infor413_g009.jpg"/>
</fig>
<p>In the <italic>k</italic>-NN graph, the out-degrees of all objects are identical but the in-degrees of objects vary depending on how many other objects regard it as its <italic>k</italic>-NN. The objects near the core of the cluster have high in-degrees, and thus have high centrality and centre-proximity scores. The outliers and the fringe objects have small in-degrees, and thus have low centrality scores. However, the fringe objects have more out-edges that are pointing to high centre-proximity objects than the outlier objects, and thus show higher centre-proximity scores than those of the outliers. Therefore, it is possible to differentiate outliers and fringe objects with the <italic>k</italic>-NN graph.</p>
<p>The precision of outlier detection using the <italic>k</italic>-NN graph may change with the change in a parameter value of <italic>k</italic>. When <italic>k</italic> is very small, objects are sparsely connected and may not be able to clearly form a cluster. As a result, the precision of detecting outliers decreases. As <italic>k</italic> increases, the clusters are clearly formed, and thus, the precision increases. However, when <italic>k</italic> is very large, the objects in the dataset may be recognized as a single cluster and show a similar result as the complete graph.</p>
<p>Nevertheless, the <italic>k</italic>-NN graph scheme shows relatively small fluctuation in the precision of outlier detection when the value of <italic>k</italic> changes, compared to the <italic>e</italic>-NN graph scheme that forms a very different resulting graph based on the distribution of the dataset. This is because the <italic>k</italic>-NN graph scheme <italic>equally connects k nearest neighbours for each object</italic>. In this paper, through extensive experiments, we show that modelling a given dataset as a <italic>k</italic>-NN graph has a higher precision than the other two graph modelling schemes in outlier detection and that the precision is less affected by the change in <italic>k</italic>.</p>
</sec>
</sec>
<sec id="j_infor413_s_014">
<label>5.2</label>
<title>Weight Assignment</title>
<p>A weight on an edge represents the similarity between two objects connected by the edge. In this paper, the weight is calculated via the following two methods.</p>
<sec id="j_infor413_s_015">
<label>5.2.1</label>
<title>Euclidean Similarity</title>
<p>The Euclidean similarity is the opposite concept of the Euclidean distance. The Euclidean similarity between two objects <italic>a</italic> and <italic>b</italic> is computed by the following equation (Tan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_027">2005</xref>). 
<disp-formula id="j_infor413_eq_004">
<label>(4)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="italic">Euclidean</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">Similarity</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">a</mml:mi><mml:mo mathvariant="normal">,</mml:mo><mml:mi mathvariant="italic">b</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>−</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mo largeop="false" movablelimits="false">∑</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub><mml:mo>−</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">b</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt><mml:mo>−</mml:mo><mml:mo movablelimits="false">min</mml:mo></mml:mrow><mml:mrow><mml:mo movablelimits="false">max</mml:mo><mml:mo>−</mml:mo><mml:mo movablelimits="false">min</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo mathvariant="normal">,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[ \mathit{Euclidean}\text{-}\mathit{Similarity}(a,b)=1-\frac{\sqrt{{\textstyle\textstyle\sum _{i=1}^{d}}{({a_{i}}-{b_{i}})^{2}}}-\min }{\max -\min },\]]]></tex-math></alternatives>
</disp-formula> 
where each element <inline-formula id="j_infor413_ineq_048"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">e</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${e_{i}}$]]></tex-math></alternatives></inline-formula> in the vector represents the <italic>i</italic>-th attribute value of the object, <italic>d</italic> is the number of attribute values (dimension) in the object, max indicates the Euclidean distance between two objects that are farthest apart in the dataset, and min indicates the Euclidean distance between two objects that are located closest in the dataset.</p>
</sec>
<sec id="j_infor413_s_016">
<label>5.2.2</label>
<title>Cosine Similarity</title>
<p>Each object can be represented as a vector in the Euclidean space. Here, each element <inline-formula id="j_infor413_ineq_049"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">e</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${e_{i}}$]]></tex-math></alternatives></inline-formula> in the vector represents the <italic>i</italic>-th attribute value of the object and <italic>d</italic> is the number of attribute values (dimension) in the object. The cosine similarity between two objects <italic>a</italic> and <italic>b</italic> is the cosine value between two corresponding vectors (Tan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_027">2005</xref>). 
<disp-formula id="j_infor413_eq_005">
<label>(5)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mi mathvariant="italic">Cosine</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">Similarity</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">a</mml:mi><mml:mo mathvariant="normal">,</mml:mo><mml:mi mathvariant="italic">b</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mo largeop="false" movablelimits="false">∑</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub><mml:mo>×</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">b</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mo largeop="false" movablelimits="false">∑</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>×</mml:mo><mml:msubsup><mml:mrow><mml:mo largeop="false" movablelimits="false">∑</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi mathvariant="italic">b</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[ \mathit{Cosine}\text{-}\mathit{Similarity}(a,b)=\frac{\sqrt{{\textstyle\textstyle\sum _{i=1}^{d}}{a_{i}}\times {b_{i}}}}{\sqrt{{\textstyle\textstyle\sum _{i=1}^{d}}{a_{i}^{2}}\times {\textstyle\textstyle\sum _{i=1}^{d}}{b_{i}^{2}}}}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In this paper, we conduct a series of experiments to compare the precision of outlier detection with different weight assignment methods and to identify the superior one.</p>
</sec>
</sec>
</sec>
<sec id="j_infor413_s_017">
<label>6</label>
<title>Performance Evaluation</title>
<p>In this section, we measure the performance of our proposed method. First, we evaluate the quality of found outliers through a simple toy experiment. Second, we examine the change of the precision with our method while changing parameter values to show the robustness of our method. Third, we compare the precision and the execution time of our method with those of previous outlier detection methods. Finally, we show some interesting results when applying our method to a real-world dataset.</p>
<p>It is noteworthy that we choose <italic>k</italic> neighbours randomly if one object has more than <italic>k</italic> neighbours with same distances (each interior point has four neighbours: up, down, left, and right in Fig. <xref rid="j_infor413_fig_009">9</xref>). However, in real-world data, it is a very rare case that one object has more than <italic>k</italic> neighbours with exactly same distances. In fact, in our experiments, there are no such objects having more than <italic>k</italic> neighbours with the same distances. We conducted our experiments several times for removing random effect but the precision of each experiment is exactly the same.</p>
<sec id="j_infor413_s_018">
<label>6.1</label>
<title>Environment for Experiments</title>
<fig id="j_infor413_fig_010">
<label>Fig. 10</label>
<caption>
<p>Chameleon datasets (Karypis <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_017">1999</xref>).</p>
</caption>
<graphic xlink:href="infor413_g010.jpg"/>
</fig>
<p>For experiments, we used (1) the well-known four 2-dimensional synthetic datasets used in Chameleon (Karypis <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_017">1999</xref>) and (2) one real-world dataset composed of the records of the NBA players (Basketball-Reference.com). The Chameleon datasets are composed of 8,000, 8,000, 8,000, and 10,000 objects, respectively, as shown in Fig. <xref rid="j_infor413_fig_010">10</xref>. The NBA dataset is composed of statistics of 585 basketball players with 23 attributes in the 2004–2005 NBA regular season.</p>
<p>We used the <italic>precision</italic>, <italic>recall</italic>, and the <italic>execution time</italic> as our evaluation metrics. The precision and recall are calculated as follows: <disp-formula-group id="j_infor413_dg_002">
<disp-formula id="j_infor413_eq_006">
<label>(6)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt"><mml:mtr><mml:mtd class="align-odd"/><mml:mtd class="align-even"><mml:mi mathvariant="italic">Precision</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">#</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">ground</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">truth</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi><mml:mo>∩</mml:mo><mml:mi mathvariant="italic">predicted</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">#</mml:mi><mml:mi mathvariant="italic">predicted</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo mathvariant="normal">,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[\begin{aligned}{}& \mathit{Precision}=\frac{\mathrm{\# }(\mathit{ground}\text{-}\mathit{truth}\hspace{2.5pt}\mathit{outliers}\cap \mathit{predicted}\hspace{2.5pt}\mathit{outliers})}{\mathrm{\# }\mathit{predicted}\hspace{2.5pt}\mathit{outliers}},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor413_eq_007">
<label>(7)</label><alternatives>
<mml:math display="block"><mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt"><mml:mtr><mml:mtd class="align-odd"/><mml:mtd class="align-even"><mml:mi mathvariant="italic">Recall</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac><mml:mrow><mml:mi mathvariant="normal">#</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mi mathvariant="italic">ground</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">truth</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi><mml:mo>∩</mml:mo><mml:mi mathvariant="italic">predicted</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi><mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">#</mml:mi><mml:mi mathvariant="italic">ground</mml:mi><mml:mtext>-</mml:mtext><mml:mi mathvariant="italic">truth</mml:mi><mml:mspace width="2.5pt"/><mml:mi mathvariant="italic">outliers</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<tex-math><![CDATA[\[\begin{aligned}{}& \mathit{Recall}=\frac{\mathrm{\# }(\mathit{ground}\text{-}\mathit{truth}\hspace{2.5pt}\mathit{outliers}\cap \mathit{predicted}\hspace{2.5pt}\mathit{outliers})}{\mathrm{\# }\mathit{ground}\text{-}\mathit{truth}\hspace{2.5pt}\mathit{outliers}}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group> In the following experiments, we set the number of all predicted outliers as same as the number of all ground-truth outliers, which makes the precision and recall always have the same value in all cases (i.e. <italic>Precision</italic> = <italic>Recall</italic>). Thus, we only show the precision. We asked five human experts to build the ground truth labels for the precision and recall. Each human expert made the decision whether each object is an outlier or not. We use the decision of the majority of human experts made as the ground-truth. The numbers of outliers chosen from the Chameleon datasets are 328, 803, 1,163, and 945, respectively. In Fig. <xref rid="j_infor413_fig_010">10</xref>, the normal objects are shown as crosses (green colour) and the outliers are shown as squares (red colour).</p>
<p>All the experiments performed in this paper were executed on Intel i7 920 with 16 GB RAM running Windows 7, and the programs were coded in C#.</p>
</sec>
<sec id="j_infor413_s_019">
<label>6.2</label>
<title>Qualitative Analysis</title>
<p>We first conducted a toy experiment to show that our proposed method works for the problems mentioned earlier. For this purpose, we synthetically generated a small 2-dimentional dataset. The dataset contains two clusters (each with fringe objects and different densities), a micro-cluster, and a few randomly generated outlier objects. The dataset is composed of 266 objects and 29 outliers.</p>
<p>Figure <xref rid="j_infor413_fig_011">11</xref> shows the dataset and the result of our proposed method on it. We used <inline-formula id="j_infor413_ineq_050"><alternatives>
<mml:math><mml:mi mathvariant="italic">k</mml:mi><mml:mo>=</mml:mo><mml:mn>10</mml:mn></mml:math>
<tex-math><![CDATA[$k=10$]]></tex-math></alternatives></inline-formula> and the Euclidean similarity as a similarity measure. In Fig. <xref rid="j_infor413_fig_011">11</xref>, the objects marked with the cross (green colour) are normal objects, and the ones marked with the square (red colour) are the outliers detected by the proposed method. The results show that our proposed method does not suffer from (a) the local density problem and (b) the micro-cluster problem, and (c) can differentiate between fringe objects and outliers.</p>
<fig id="j_infor413_fig_011">
<label>Fig. 11</label>
<caption>
<p>Result of the proposed outlier detection method.</p>
</caption>
<graphic xlink:href="infor413_g011.jpg"/>
</fig>
<fig id="j_infor413_fig_012">
<label>Fig. 12</label>
<caption>
<p>Precision with varying graph modelling schemes.</p>
</caption>
<graphic xlink:href="infor413_g012.jpg"/>
</fig>
<p>We also detected outliers using <italic>k</italic>-Dist (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>) (distance-based) and LOF (Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>) (density-based) methods. As a result, not surprisingly, all the three problems mentioned above occurred in the <italic>k</italic>-Dist method and the two problems except for the local density problem occurred in the LOF method.</p>
</sec>
<sec id="j_infor413_s_020">
<label>6.3</label>
<title>Analysis on the Proposed Outlier Detection Method</title>
<p>In this subsection, we used the four Chameleon datasets to evaluate the precision of our method while changing (1) graph modelling schemes, (2) weight assignment methods, and (3) the number of iterations (MAX_ITERATIONS).</p>
<sec id="j_infor413_s_021">
<label>6.3.1</label>
<title>Analysis on Graph Modelling Schemes</title>
<p>In this experiment, we compare the three different graph modelling schemes we discussed in Section <xref rid="j_infor413_s_010">5.1</xref>: (1) complete graph, (2) <italic>e</italic>-NN graph, and (3) <italic>k</italic>-NN graph. We used Chameleon datasets to build graphs with each modelling scheme. After building graphs, we detected outliers using our proposed method while varying parameter <italic>e</italic> of the <italic>e</italic>-NN graph from 0.5% to 10% in step of 0.5% of the furthest distance between two objects in the dataset. Also, the parameter <italic>k</italic> of the <italic>k</italic>-NN graph was varied from 0.5% to 10% in steps of 0.5% of the total number of objects in the dataset, respectively.</p>
<p>A weight on each edge was assigned using the Euclidean similarity, and the number of outliers to be detected was set as equal to the number of outliers selected by humans in each dataset.</p>
<p>Figure <xref rid="j_infor413_fig_012">12</xref> shows the results. The <italic>x</italic>-axis indicates the parameter value of each modelling scheme and the <italic>y</italic>-axis shows the precision. The complete graph shows the lowest precision. In case of the <italic>e</italic>-NN graph, the fluctuation of the precision is very large according to the change of <italic>e</italic>. The <italic>k</italic>-NN graph, compared to the other graph modelling schemes, shows the highest precision and shows a much smaller fluctuation of the precision according to the parameter change than the <italic>e</italic>-NN graph.</p>
</sec>
<sec id="j_infor413_s_022">
<label>6.3.2</label>
<title>Analysis on Weight Assignment Methods</title>
<p>In this experiment, we first modelled the four Chameleon datasets as <italic>k</italic>-NN graphs and assigned edge weights using (1) the Euclidean similarity and (2) the cosine similarity. Then, we measured the average precision of our method with a varying parameter <italic>k</italic> in the <italic>k</italic>-NN graph from 0.5% to 10% in steps of 0.5% of the total number of objects in the dataset. The number of outliers to be detected was set as equal to the number of outliers selected by humans in each dataset.</p>
<p>Figure <xref rid="j_infor413_fig_013">13</xref> shows the results. The Euclidean similarity method shows superior precision to the cosine similarity method. In the case of the cosine similarity, even for two distant objects, if the cosine value between their corresponding vectors is 0, a high weight is assigned to the edge connecting the two objects. This could make our method unable to accurately detect outliers.</p>
<fig id="j_infor413_fig_013">
<label>Fig. 13</label>
<caption>
<p>Precision with varying weight assignment methods.</p>
</caption>
<graphic xlink:href="infor413_g013.jpg"/>
</fig>
</sec>
<sec id="j_infor413_s_023">
<label>6.3.3</label>
<title>Analysis on the Numbers of Iterations</title>
<p>In this experiment, we measured the average precision of our method changing the number of iterations (MAX_ ITERATIONS) from 1 to 15 for calculating centrality and centre-proximity scores. The four Chameleon datasets were modelled as <italic>k</italic>-NN graphs and the edge weights were assigned by the Euclidean similarity method. Here, <italic>k</italic> was set as the value providing the best precision for each dataset, as shown in Fig. <xref rid="j_infor413_fig_012">12</xref>. Also, in order to measure the precision, the number of outliers to be detected was set as equal to the number of true outliers shown by humans in the dataset.</p>
<p>Figure <xref rid="j_infor413_fig_014">14</xref> shows the results. The precision increases as the number of iterations increases. When the number of iterations exceeds 5 or 6, both the centrality and centre-proximity scores of all the objects converge to a specific value and the precision does not change. So, we know that the actual execution time of outlier detection by our method is not that large.</p>
<fig id="j_infor413_fig_014">
<label>Fig. 14</label>
<caption>
<p>Precision with a varying number of iterations.</p>
</caption>
<graphic xlink:href="infor413_g014.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_infor413_s_024">
<label>6.4</label>
<title>Comparisons with Other Methods</title>
<p>In this experiment, we evaluated the average precision and the average execution time of five different outlier detection methods, <italic>k</italic>-Dist (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>) (distance-based), LOF (Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>) (density-based), Outrank-a, Outrank-b (RWR-based), and the proposed method. For this evaluation, we used four Chameleon datasets. For each method, we select the parameter values (e.g. <italic>k</italic> for <italic>k</italic>-NN graph), among multiple candidate values, that provide the best precision.</p>
<sec id="j_infor413_s_025">
<label>6.4.1</label>
<title>Precision</title>
<p>In this experiment, we used the Chameleon datasets to show the effectiveness of our proposed method by average precisions. We compare our method to the existing outlier detection methods. The parameters for the existing methods were set to the same values as in their papers that showed <italic>the best precision for each dataset</italic> (Knorr <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_021">2000</xref>; Moonesinghe and Tan, <xref ref-type="bibr" rid="j_infor413_ref_022">2008</xref>; Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>; Breunig <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_006">2000</xref>).</p>
<p>Table <xref rid="j_infor413_tab_001">1</xref> shows the results. Our proposed method that takes account of the distribution of all the objects showed the best average precision. The density-based outlier detection method (LOF) that only takes account of the influence of the neighbouring objects showed a higher precision than the distance-based outlier detection method (<italic>k</italic>-Dist) that only takes account of the influence of the target object itself. The Outrank methods basically use a concept similar to the proposed method in terms of modelling a given dataset as a graph. However, they have problems in their location features and in the graph modelling schemes, which leads to a very low precision.</p>
<table-wrap id="j_infor413_tab_001">
<label>Table 1</label>
<caption>
<p>Precision of five different methods.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>k</italic>-Dist</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LOF</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Outrank-a</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Outrank-b</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Proposed method</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">data1</td>
<td style="vertical-align: top; text-align: left">0.70</td>
<td style="vertical-align: top; text-align: left">0.85</td>
<td style="vertical-align: top; text-align: left">0.07</td>
<td style="vertical-align: top; text-align: left">0.09</td>
<td style="vertical-align: top; text-align: left"><bold>0.88</bold></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data2</td>
<td style="vertical-align: top; text-align: left"><bold>0.88</bold></td>
<td style="vertical-align: top; text-align: left">0.85</td>
<td style="vertical-align: top; text-align: left">0.13</td>
<td style="vertical-align: top; text-align: left">0.23</td>
<td style="vertical-align: top; text-align: left">0.86</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data3</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">0.12</td>
<td style="vertical-align: top; text-align: left">0.17</td>
<td style="vertical-align: top; text-align: left"><bold>0.95</bold></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data4</td>
<td style="vertical-align: top; text-align: left">0.91</td>
<td style="vertical-align: top; text-align: left">0.89</td>
<td style="vertical-align: top; text-align: left">0.18</td>
<td style="vertical-align: top; text-align: left">0.15</td>
<td style="vertical-align: top; text-align: left"><bold>0.91</bold></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Average</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.86</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.88</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.13</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.16</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>0.90</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor413_s_026">
<label>6.4.2</label>
<title>Execution Time</title>
<p>In this experiment, we measured the average execution time of five different outlier detection methods over the four Chameleon datasets. To remove randomness effects, we executed them five times and calculated the average execution times. Here, the parameters of each outlier detection method were set to the values that showed the best precision for each dataset.</p>
<p>Table <xref rid="j_infor413_tab_002">2</xref> shows the results in the unit of microseconds (<italic>ms</italic>). The proposed method, which takes account of the characteristics of all the objects and shows the highest precision, does not show any big difference in terms of the execution time compared to other outlier detection methods.</p>
<p>In case of the Outrank-a method, the given dataset is modelled as a <italic>complete graph</italic>. Thus, it was shown to require a lot of time to detect the outliers. In case of the Outrank-b method, the execution time of <italic>the sharing neighbourhood similarity measure</italic> (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>), which assigns the edge weight in the Outrank-b method, takes a huge amount of time. Thus, its overall execution time was shown to be very large.</p>
<table-wrap id="j_infor413_tab_002">
<label>Table 2</label>
<caption>
<p>Execution time of five different methods (<inline-formula id="j_infor413_ineq_051"><alternatives>
<mml:math><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">s</mml:mi></mml:math>
<tex-math><![CDATA[$ms$]]></tex-math></alternatives></inline-formula>).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>k</italic>-Dist</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">LOF</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Outrank-a</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Outrank-b</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Proposed method</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">data1</td>
<td style="vertical-align: top; text-align: left">56,447</td>
<td style="vertical-align: top; text-align: left"><bold>52,510</bold></td>
<td style="vertical-align: top; text-align: left">408,352</td>
<td style="vertical-align: top; text-align: left">9,730,417</td>
<td style="vertical-align: top; text-align: left">56,515</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data2</td>
<td style="vertical-align: top; text-align: left">55,913</td>
<td style="vertical-align: top; text-align: left"><bold>54,718</bold></td>
<td style="vertical-align: top; text-align: left">402,726</td>
<td style="vertical-align: top; text-align: left">7,951,859</td>
<td style="vertical-align: top; text-align: left">57,029</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data3</td>
<td style="vertical-align: top; text-align: left">56,054</td>
<td style="vertical-align: top; text-align: left"><bold>54,409</bold></td>
<td style="vertical-align: top; text-align: left">403,579</td>
<td style="vertical-align: top; text-align: left">7,836,638</td>
<td style="vertical-align: top; text-align: left">57,214</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">data4</td>
<td style="vertical-align: top; text-align: left"><bold>86,974</bold></td>
<td style="vertical-align: top; text-align: left">89,522</td>
<td style="vertical-align: top; text-align: left">673,850</td>
<td style="vertical-align: top; text-align: left">19,130,638</td>
<td style="vertical-align: top; text-align: left">87,340</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Average</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">63,847</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>62,790</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">472,127</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">11,162,388</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">64,525</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="j_infor413_s_027">
<label>6.5</label>
<title>Detecting Outliers from a Real-World Dataset</title>
<p>In this experiment, we applied our proposed outlier detection method to a real-world dataset. The dataset is composed of the statistics of 585 basketball players in the 2004–2005 NBA regular season. From the NBA dataset, we derived the attributes as shown in Table <xref rid="j_infor413_tab_003">3</xref>. To derive the attributes, we used the formulas provided in ESPN (ESPN Fantasy Basketball). We normalized the domain of each attribute in order to prevent the situation where the domain of a specific attribute is very large. The attribute <italic>a</italic> of object <italic>v</italic> was normalized using the formula <inline-formula id="j_infor413_ineq_052"><alternatives>
<mml:math><mml:mstyle displaystyle="false"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">v</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow></mml:msub><mml:mo>−</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow><mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:math>
<tex-math><![CDATA[$\frac{{v_{a}}-\bar{a}}{{\sigma _{a}}}$]]></tex-math></alternatives></inline-formula> (Ramaswamy <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor413_ref_025">2000</xref>). Here, <inline-formula id="j_infor413_ineq_053"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">v</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${v_{a}}$]]></tex-math></alternatives></inline-formula> is the attribute <italic>a</italic> of an object <italic>v</italic>, <inline-formula id="j_infor413_ineq_054"><alternatives>
<mml:math><mml:mover accent="true"><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow><mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math>
<tex-math><![CDATA[$\bar{a}$]]></tex-math></alternatives></inline-formula> is the average of attribute <italic>a</italic>, and <inline-formula id="j_infor413_ineq_055"><alternatives>
<mml:math><mml:msub><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">a</mml:mi></mml:mrow></mml:msub></mml:math>
<tex-math><![CDATA[${\sigma _{a}}$]]></tex-math></alternatives></inline-formula> is the standard deviation of attribute <italic>a</italic>.</p>
<table-wrap id="j_infor413_tab_003">
<label>Table 3</label>
<caption>
<p>Derived attributes from the NBA 2004 dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Derived attributes</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Description</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Formula</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">FG%</td>
<td style="vertical-align: top; text-align: left">Field goal percentage</td>
<td style="vertical-align: top; text-align: left">#field goals made #field goals attempted</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">FT%</td>
<td style="vertical-align: top; text-align: left">Free throw percentage</td>
<td style="vertical-align: top; text-align: left">#free throws made #free throws attempted</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3P%</td>
<td style="vertical-align: top; text-align: left">Three-point percentage</td>
<td style="vertical-align: top; text-align: left">#three-point goals made / #three-point goals attempted</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">ASTPG</td>
<td style="vertical-align: top; text-align: left">#assists per game</td>
<td style="vertical-align: top; text-align: left">#assists / #played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">BLKPG</td>
<td style="vertical-align: top; text-align: left">#blocks per game</td>
<td style="vertical-align: top; text-align: left">#blocks / #played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MINPG</td>
<td style="vertical-align: top; text-align: left">Minutes per game</td>
<td style="vertical-align: top; text-align: left">Minutes of played game / #played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">PTSPG</td>
<td style="vertical-align: top; text-align: left">Points per game</td>
<td style="vertical-align: top; text-align: left">Points / Number of played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">REBPG</td>
<td style="vertical-align: top; text-align: left">#rebounds per game</td>
<td style="vertical-align: top; text-align: left">#rebounds / #played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">STLPG</td>
<td style="vertical-align: top; text-align: left">#steals per game</td>
<td style="vertical-align: top; text-align: left">#steals / #played games</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">TOPG</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">#turnovers per game</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">#turnovers / #played games</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We set <inline-formula id="j_infor413_ineq_056"><alternatives>
<mml:math><mml:mi mathvariant="italic">k</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math>
<tex-math><![CDATA[$k=5$]]></tex-math></alternatives></inline-formula> for the NBA dataset in this experiment. This is because, as shown in Fig. <xref rid="j_infor413_fig_012">12</xref>, high accuracy is observed with <italic>k</italic> set as about 1% of the given data size in terms of the number of nodes; note that the size of the NBA dataset is 585. The top 15 players having the highest outlierness scores were detected as outliers. The result is provided in Table <xref rid="j_infor413_tab_004">4</xref>. Note that we removed the players having lower ‘Minutes Played’ compared to the average of 2004 NBA players.</p>
<table-wrap id="j_infor413_tab_004">
<label>Table 4</label>
<caption>
<p>Top 15 outlier 2004 NBA players.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">fname</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">lname</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">team</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">gp</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">min</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">FG%</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">FT%</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">3P%</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">ASTPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">BLKPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">MINPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">PTSPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">REBPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">STLPG</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">TOPG</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Steve</bold></td>
<td style="vertical-align: top; text-align: left"><bold>Nash</bold></td>
<td style="vertical-align: top; text-align: left">PHO</td>
<td style="vertical-align: top; text-align: left">75</td>
<td style="vertical-align: top; text-align: left">2,573</td>
<td style="vertical-align: top; text-align: left">0.78</td>
<td style="vertical-align: top; text-align: left">0.94</td>
<td style="vertical-align: top; text-align: left">1.14</td>
<td style="vertical-align: top; text-align: left"><bold>5.61</bold></td>
<td style="vertical-align: top; text-align: left">−0.63</td>
<td style="vertical-align: top; text-align: left">1.38</td>
<td style="vertical-align: top; text-align: left">1.31</td>
<td style="vertical-align: top; text-align: left">−0.05</td>
<td style="vertical-align: top; text-align: left">0.79</td>
<td style="vertical-align: top; text-align: left">2.73</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Andrei</bold></td>
<td style="vertical-align: top; text-align: left"><bold>Kirilenko</bold></td>
<td style="vertical-align: top; text-align: left">UTA</td>
<td style="vertical-align: top; text-align: left">41</td>
<td style="vertical-align: top; text-align: left">1,349</td>
<td style="vertical-align: top; text-align: left">0.69</td>
<td style="vertical-align: top; text-align: left">0.44</td>
<td style="vertical-align: top; text-align: left">0.43</td>
<td style="vertical-align: top; text-align: left">0.85</td>
<td style="vertical-align: top; text-align: left"><bold>6.00</bold></td>
<td style="vertical-align: top; text-align: left">1.24</td>
<td style="vertical-align: top; text-align: left">1.32</td>
<td style="vertical-align: top; text-align: left">1.16</td>
<td style="vertical-align: top; text-align: left">2.23</td>
<td style="vertical-align: top; text-align: left">1.34</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Allen</bold></td>
<td style="vertical-align: top; text-align: left"><bold>Iverson</bold></td>
<td style="vertical-align: top; text-align: left">PHI</td>
<td style="vertical-align: top; text-align: left">75</td>
<td style="vertical-align: top; text-align: left">3,174</td>
<td style="vertical-align: top; text-align: left">−0.01</td>
<td style="vertical-align: top; text-align: left">0.69</td>
<td style="vertical-align: top; text-align: left">0.48</td>
<td style="vertical-align: top; text-align: left">3.59</td>
<td style="vertical-align: top; text-align: left">−0.55</td>
<td style="vertical-align: top; text-align: left">2.15</td>
<td style="vertical-align: top; text-align: left">3.86</td>
<td style="vertical-align: top; text-align: left">0.23</td>
<td style="vertical-align: top; text-align: left">3.93</td>
<td style="vertical-align: top; text-align: left">4.44</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Jimmy</td>
<td style="vertical-align: top; text-align: left">Jackson</td>
<td style="vertical-align: top; text-align: left">PHO</td>
<td style="vertical-align: top; text-align: left">40</td>
<td style="vertical-align: top; text-align: left">997</td>
<td style="vertical-align: top; text-align: left">0.100</td>
<td style="vertical-align: top; text-align: left">1.30</td>
<td style="vertical-align: top; text-align: left">1.30</td>
<td style="vertical-align: top; text-align: left">0.40</td>
<td style="vertical-align: top; text-align: left">−0.59</td>
<td style="vertical-align: top; text-align: left">0.46</td>
<td style="vertical-align: top; text-align: left">0.17</td>
<td style="vertical-align: top; text-align: left">0.18</td>
<td style="vertical-align: top; text-align: left">−0.73</td>
<td style="vertical-align: top; text-align: left">0.41</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Larry</bold></td>
<td style="vertical-align: top; text-align: left"><bold>Hughes</bold></td>
<td style="vertical-align: top; text-align: left">WAS</td>
<td style="vertical-align: top; text-align: left">61</td>
<td style="vertical-align: top; text-align: left">2,358</td>
<td style="vertical-align: top; text-align: left">0.05</td>
<td style="vertical-align: top; text-align: left">0.41</td>
<td style="vertical-align: top; text-align: left">0.34</td>
<td style="vertical-align: top; text-align: left">1.69</td>
<td style="vertical-align: top; text-align: left">−0.19</td>
<td style="vertical-align: top; text-align: left">1.80</td>
<td style="vertical-align: top; text-align: left">2.41</td>
<td style="vertical-align: top; text-align: left">1.18</td>
<td style="vertical-align: top; text-align: left">5.00</td>
<td style="vertical-align: top; text-align: left">1.75</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Shaun</td>
<td style="vertical-align: top; text-align: left">Livingston</td>
<td style="vertical-align: top; text-align: left">LAC</td>
<td style="vertical-align: top; text-align: left">30</td>
<td style="vertical-align: top; text-align: left">812</td>
<td style="vertical-align: top; text-align: left">−0.12</td>
<td style="vertical-align: top; text-align: left">0.25</td>
<td style="vertical-align: top; text-align: left">−1.19</td>
<td style="vertical-align: top; text-align: left">1.90</td>
<td style="vertical-align: top; text-align: left">−0.04</td>
<td style="vertical-align: top; text-align: left">0.67</td>
<td style="vertical-align: top; text-align: left">−0.06</td>
<td style="vertical-align: top; text-align: left">−0.19</td>
<td style="vertical-align: top; text-align: left">0.97</td>
<td style="vertical-align: top; text-align: left">1.74</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Ruben</td>
<td style="vertical-align: top; text-align: left">Patterson</td>
<td style="vertical-align: top; text-align: left">POR</td>
<td style="vertical-align: top; text-align: left">70</td>
<td style="vertical-align: top; text-align: left">1,957</td>
<td style="vertical-align: top; text-align: left">1.08</td>
<td style="vertical-align: top; text-align: left">−0.46</td>
<td style="vertical-align: top; text-align: left">−0.76</td>
<td style="vertical-align: top; text-align: left">0.14</td>
<td style="vertical-align: top; text-align: left">−0.18</td>
<td style="vertical-align: top; text-align: left">0.76</td>
<td style="vertical-align: top; text-align: left">0.64</td>
<td style="vertical-align: top; text-align: left">0.20</td>
<td style="vertical-align: top; text-align: left">1.96</td>
<td style="vertical-align: top; text-align: left">1.14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Greg</td>
<td style="vertical-align: top; text-align: left">Buckner</td>
<td style="vertical-align: top; text-align: left">DEN</td>
<td style="vertical-align: top; text-align: left">70</td>
<td style="vertical-align: top; text-align: left">1,522</td>
<td style="vertical-align: top; text-align: left">1.05</td>
<td style="vertical-align: top; text-align: left">0.41</td>
<td style="vertical-align: top; text-align: left">1.00</td>
<td style="vertical-align: top; text-align: left">0.09</td>
<td style="vertical-align: top; text-align: left">−0.62</td>
<td style="vertical-align: top; text-align: left">0.16</td>
<td style="vertical-align: top; text-align: left">−0.27</td>
<td style="vertical-align: top; text-align: left">−0.19</td>
<td style="vertical-align: top; text-align: left">0.98</td>
<td style="vertical-align: top; text-align: left">−0.60</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Reggie</td>
<td style="vertical-align: top; text-align: left">Evans</td>
<td style="vertical-align: top; text-align: left">SEA</td>
<td style="vertical-align: top; text-align: left">79</td>
<td style="vertical-align: top; text-align: left">1,881</td>
<td style="vertical-align: top; text-align: left">0.52</td>
<td style="vertical-align: top; text-align: left">−0.78</td>
<td style="vertical-align: top; text-align: left">−1.19</td>
<td style="vertical-align: top; text-align: left">−0.58</td>
<td style="vertical-align: top; text-align: left">−0.40</td>
<td style="vertical-align: top; text-align: left">0.36</td>
<td style="vertical-align: top; text-align: left">−0.49</td>
<td style="vertical-align: top; text-align: left">2.45</td>
<td style="vertical-align: top; text-align: left">0.24</td>
<td style="vertical-align: top; text-align: left">0.20</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Ben</td>
<td style="vertical-align: top; text-align: left">Wallace</td>
<td style="vertical-align: top; text-align: left">DET</td>
<td style="vertical-align: top; text-align: left">74</td>
<td style="vertical-align: top; text-align: left">2,672</td>
<td style="vertical-align: top; text-align: left">0.29</td>
<td style="vertical-align: top; text-align: left">−1.3</td>
<td style="vertical-align: top; text-align: left">−0.59</td>
<td style="vertical-align: top; text-align: left">−0.04</td>
<td style="vertical-align: top; text-align: left">4.08</td>
<td style="vertical-align: top; text-align: left">1.55</td>
<td style="vertical-align: top; text-align: left">0.33</td>
<td style="vertical-align: top; text-align: left">3.65</td>
<td style="vertical-align: top; text-align: left">1.78</td>
<td style="vertical-align: top; text-align: left">−0.10</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Ervin</td>
<td style="vertical-align: top; text-align: left">Johnson</td>
<td style="vertical-align: top; text-align: left">MIN</td>
<td style="vertical-align: top; text-align: left">46</td>
<td style="vertical-align: top; text-align: left">410</td>
<td style="vertical-align: top; text-align: left">0.96</td>
<td style="vertical-align: top; text-align: left">−0.26</td>
<td style="vertical-align: top; text-align: left">4.22</td>
<td style="vertical-align: top; text-align: left">−0.92</td>
<td style="vertical-align: top; text-align: left">−0.21</td>
<td style="vertical-align: top; text-align: left">−1.09</td>
<td style="vertical-align: top; text-align: left">−1.04</td>
<td style="vertical-align: top; text-align: left">−0.41</td>
<td style="vertical-align: top; text-align: left">−1.05</td>
<td style="vertical-align: top; text-align: left">−1.00</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Kevin</bold></td>
<td style="vertical-align: top; text-align: left"><bold>Garnett</bold></td>
<td style="vertical-align: top; text-align: left">MIN</td>
<td style="vertical-align: top; text-align: left">82</td>
<td style="vertical-align: top; text-align: left">3,121</td>
<td style="vertical-align: top; text-align: left">0.79</td>
<td style="vertical-align: top; text-align: left">0.57</td>
<td style="vertical-align: top; text-align: left">0.11</td>
<td style="vertical-align: top; text-align: left">2.27</td>
<td style="vertical-align: top; text-align: left">2.00</td>
<td style="vertical-align: top; text-align: left">1.74</td>
<td style="vertical-align: top; text-align: left">2.43</td>
<td style="vertical-align: top; text-align: left">4.20</td>
<td style="vertical-align: top; text-align: left">1.88</td>
<td style="vertical-align: top; text-align: left">2.01</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Eddie</td>
<td style="vertical-align: top; text-align: left">Griffin</td>
<td style="vertical-align: top; text-align: left">MIN</td>
<td style="vertical-align: top; text-align: left">70</td>
<td style="vertical-align: top; text-align: left">1,492</td>
<td style="vertical-align: top; text-align: left">−0.39</td>
<td style="vertical-align: top; text-align: left">0.12</td>
<td style="vertical-align: top; text-align: left">0.59</td>
<td style="vertical-align: top; text-align: left">−0.56</td>
<td style="vertical-align: top; text-align: left">2.66</td>
<td style="vertical-align: top; text-align: left">0.11</td>
<td style="vertical-align: top; text-align: left">−0.04</td>
<td style="vertical-align: top; text-align: left">1.27</td>
<td style="vertical-align: top; text-align: left">−0.66</td>
<td style="vertical-align: top; text-align: left">−0.49</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Darius</td>
<td style="vertical-align: top; text-align: left">Miles</td>
<td style="vertical-align: top; text-align: left">POR</td>
<td style="vertical-align: top; text-align: left">63</td>
<td style="vertical-align: top; text-align: left">1,698</td>
<td style="vertical-align: top; text-align: left">0.58</td>
<td style="vertical-align: top; text-align: left">−0.46</td>
<td style="vertical-align: top; text-align: left">0.69</td>
<td style="vertical-align: top; text-align: left">0.18</td>
<td style="vertical-align: top; text-align: left">1.74</td>
<td style="vertical-align: top; text-align: left">0.66</td>
<td style="vertical-align: top; text-align: left">0.85</td>
<td style="vertical-align: top; text-align: left">0.55</td>
<td style="vertical-align: top; text-align: left">1.25</td>
<td style="vertical-align: top; text-align: left">1.73</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Anthony</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Carter</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">MIN</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">66</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">742</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.04</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.55</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.41</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.23</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.86</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.85</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.99</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.22</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">−0.29</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The outliers detected in the NBA dataset not only include the outstanding players but also may include the players that have poor statistics. Steve Nash, the 3rd outlier, is shown 5.61 times higher in assists per game compared with the average of other players. Andrei Kirilenko, the 4th outlier, is shown 6 times higher in blockings per game compared with the average of other players. Additionally, the best players such as Allen Iverson (top player in points), Larry Hughes (top player in steals), and Kevin Garnet (top player in rebounds) were all included in the top 25.<xref ref-type="fn" rid="j_infor413_fn_002">2</xref><fn id="j_infor413_fn_002"><label><sup>2</sup></label>
<p><uri>http://espn.go.com/nba/statistics/_/year/2005</uri>.</p></fn> Other players who were detected as outliers either have poor statistics or did not play in many games.</p>
</sec>
</sec>
<sec id="j_infor413_s_028">
<label>7</label>
<title>Conclusions and Future Work</title>
<p>After extensive analysis of existing outlier detection methods we noticed their significant lack and problems that can cause wrong interpretation of obtained experimental results with different datasets. Our intention in recent research was to propose a novel method that would show better effects and performances. In this paper, we have proposed a novel outlier detection method based on centrality and centre-proximity notions. Apart from that, we proposed a graph-based outlier detection method and presented extensive promising experimental results.</p>
<p>The main contributions of our research presented in this paper could be summarized as follows: (1) We have proposed two novel relative location features–centrality and centre-proximity. Both features take account of the characteristics of <italic>all the objects</italic> in the dataset. (2) To solve the local density and micro-cluster problems, we have proposed a novel graph-based outlier detection method. (3) We have explored multiple graph modelling schemes and weighting schemes for our graph-based outlier detection. (4) The effectiveness and efficiency of the proposed method have been verified in extensive experiments with both synthetic and real-world datasets. Comparison with previous methods has been discussed and the advantages of our approach have been specified.</p>
<p>Based on promising and positive effects we obtained in experiments, we are going to continue with additional experiments and enhance our efforts in comparative evaluations with more sophisticated competitors (e.g. neural networks and probabilistic methods) and more datasets including benchmark datasets.</p>
</sec>
</body>
<back>
<ref-list id="j_infor413_reflist_001">
<title>References</title>
<ref id="j_infor413_ref_001">
<mixed-citation publication-type="chapter"><string-name><surname>Akoglu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>McGlohon</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Faloutsos</surname>, <given-names>C.</given-names></string-name> (<year>2010</year>). <chapter-title>OddBall: spotting anomalies in weighted graphs</chapter-title>. In: <source>Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining</source>, pp. <fpage>410</fpage>–<lpage>421</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Bae</surname>, <given-names>D.-H.</given-names></string-name>, <string-name><surname>Jeong</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>S.-W.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>M.</given-names></string-name> (<year>2012</year>). <chapter-title>Outlier detection using centrality and center-proximity</chapter-title>. In: <source>Proceedings of the 21st ACM International Conference on Information and Knowledge Management</source>, pp. <fpage>2251</fpage>–<lpage>2254</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_003">
<mixed-citation publication-type="book"><string-name><surname>Barnett</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Lewis</surname>, <given-names>T.</given-names></string-name> (<year>1994</year>). <source>Outliers in Statistical Data</source>. <publisher-name>John Wiley &amp; Sons</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Bay</surname>, <given-names>S.D.</given-names></string-name>, <string-name><surname>Schwabacher</surname>, <given-names>M.</given-names></string-name> (<year>2003</year>). <chapter-title>Mining distance-based outliers in near linear time with randomization and a simple pruning rule</chapter-title>. In: <source>Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>, pp. <fpage>29</fpage>–<lpage>38</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_005">
<mixed-citation publication-type="chapter"><string-name><surname>Böhm</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Haegler</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Müller</surname>, <given-names>N.S.</given-names></string-name>, <string-name><surname>Plant</surname>, <given-names>C.</given-names></string-name> (<year>2009</year>). <chapter-title>CoCo: coding cost for parameter-free outlier detection</chapter-title>. In: <source>Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>, pp. <fpage>149</fpage>–<lpage>158</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Breunig</surname>, <given-names>M.M.</given-names></string-name>, <string-name><surname>Kriegel</surname>, <given-names>H.P.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>R.T.</given-names></string-name>, <string-name><surname>Sander</surname>, <given-names>J.</given-names></string-name> (<year>2000</year>). <chapter-title>LOF: identifying density-based local outliers</chapter-title>. In: <source>Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data</source>, pp. <fpage>93</fpage>–<lpage>104</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_007">
<mixed-citation publication-type="chapter"><string-name><surname>Brin</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Page</surname>, <given-names>L.</given-names></string-name> (<year>1998</year>). <chapter-title>The anatomy of a large-scale hypertextual Web search engine</chapter-title>. In: <source>Proceedings of the 7th International World Wide Web Conference</source>, pp. <fpage>107</fpage>–<lpage>117</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Chan</surname>, <given-names>K.Y.</given-names></string-name>, <string-name><surname>Kwong</surname>, <given-names>C.K.</given-names></string-name>, <string-name><surname>Fogarty</surname>, <given-names>T.C.</given-names></string-name> (<year>2010</year>). <article-title>Modeling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers</article-title>. <source>Information Sciences</source>, <volume>180</volume>(<issue>4</issue>), <fpage>506</fpage>–<lpage>518</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Chandola</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Banerjee</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>V.</given-names></string-name> (<year>2009</year>). <article-title>Anomaly detection: a survey</article-title>. <source>ACM Computing Surveys</source>, <volume>41</volume>(<issue>3</issue>), <fpage>15:1</fpage>–<lpage>15:58</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Domingues</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Filippone</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Michiardi</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Zouaoui</surname>, <given-names>J.</given-names></string-name> (<year>2018</year>). <article-title>A comparative evaluation of outlier detection algorithms: experiments and analyses</article-title>. <source>Pattern Recognition</source>, <volume>74</volume>, <fpage>406</fpage>–<lpage>421</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_011">
<mixed-citation publication-type="journal"><string-name><surname>Fanaee-T</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Gama</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <article-title>Tensor-based anomaly detection: an interdisciplinary survey</article-title>. <source>Knowledge-Based Systems</source>, <volume>98</volume>, <fpage>130</fpage>–<lpage>147</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Last</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Makover</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Kandel</surname>, <given-names>A.</given-names></string-name> (<year>2007</year>). <article-title>Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology</article-title>. <source>Information Sciences</source>, <volume>177</volume>(<issue>2</issue>), <fpage>467</fpage>–<lpage>475</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Ha</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bae</surname>, <given-names>D.-H.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>S.-W.</given-names></string-name>, <string-name><surname>Baek</surname>, <given-names>S.C.</given-names></string-name>, <string-name><surname>Jeong</surname>, <given-names>B.S.</given-names></string-name> (<year>2011</year>). <chapter-title>Analyzing a Korean blogosphere: a social network analysis perspective</chapter-title>. In: <source>Proceedings of the 2011 ACM Symposium on Applied Computing</source>, pp. <fpage>773</fpage>–<lpage>777</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_014">
<mixed-citation publication-type="book"><string-name><surname>Han</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kamber</surname>, <given-names>M.</given-names></string-name> (<year>2000</year>). <source>Data Mining: Concepts and Techniques</source>. <publisher-name>Morgan Kaufmann</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_015">
<mixed-citation publication-type="book"><string-name><surname>Hawkins</surname>, <given-names>D.M.</given-names></string-name> (<year>1980</year>). <source>Identification of Outliers</source>. <publisher-name>Chapman and Hall</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_016">
<mixed-citation publication-type="journal"><string-name><surname>Huang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Feng</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <article-title>A non-parameter outlier detection algorithm based on natural neighbor</article-title>. <source>Knowledge-Based Systems</source>, <volume>92</volume>, <fpage>71</fpage>–<lpage>77</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Karypis</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>E.H.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>V.</given-names></string-name> (<year>1999</year>). <article-title>Chameleon: hierarchical clustering using dynamic modeling</article-title>. <source>IEEE Computer</source>, <volume>32</volume>(<issue>8</issue>), <fpage>68</fpage>–<lpage>75</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_018">
<mixed-citation publication-type="chapter"><string-name><surname>Kieu</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Jensen</surname>, <given-names>C.S.</given-names></string-name> (<year>2018</year>). <chapter-title>Outlier detection for multidimensional time series using deep neural networks</chapter-title>. In: <source>Proceedings of the 19th IEEE International Conference on Mobile Data Management</source>, pp. <fpage>125</fpage>–<lpage>134</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Kleinberg</surname>, <given-names>J.M.</given-names></string-name> (<year>1999</year>). <article-title>Authoritative sources in a hyperlinked environment</article-title>. <source>Journal of the ACM</source>, <volume>46</volume>(<issue>5</issue>), <fpage>604</fpage>–<lpage>632</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_020">
<mixed-citation publication-type="chapter"><string-name><surname>Knorr</surname>, <given-names>E.M.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>R.T.</given-names></string-name> (<year>1999</year>). <chapter-title>Finding intensional knowledge of distance-based outliers</chapter-title>. In: <source>Proceedings of 25th International Conference on Very Large Data Bases</source>, pp. <fpage>211</fpage>–<lpage>222</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_021">
<mixed-citation publication-type="journal"><string-name><surname>Knorr</surname>, <given-names>E.M.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>R.T.</given-names></string-name>, <string-name><surname>Tucakov</surname>, <given-names>V.</given-names></string-name> (<year>2000</year>). <article-title>Distance-based outliers: algorithms and applications</article-title>. <source>The VLDB Journal</source>, <volume>8</volume>(<issue>3–4</issue>), <fpage>237</fpage>–<lpage>253</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Moonesinghe</surname>, <given-names>H.D.K.</given-names></string-name>, <string-name><surname>Tan</surname>, <given-names>P.N.</given-names></string-name> (<year>2008</year>). <article-title>OutRank: a graph-based outlier detection framework using random walk</article-title>. <source>International Journal on Artificial Intelligence Tools</source>, <volume>17</volume>(<issue>01</issue>), <fpage>19</fpage>–<lpage>36</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Na</surname>, <given-names>G.S.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>H.</given-names></string-name> (<year>2018</year>). <chapter-title>DILOF: effective and memory efficient local outlier detection in data streams</chapter-title>. In: <source>Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>, pp. <fpage>1993</fpage>–<lpage>2002</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_024">
<mixed-citation publication-type="chapter"><string-name><surname>Papadimitriou</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kitagawa</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Gibbons</surname>, <given-names>P.B.</given-names></string-name>, <string-name><surname>Faloutsos</surname>, <given-names>C.</given-names></string-name> (<year>2003</year>). <chapter-title>LOCI: fast outlier detection using the local correlation integral</chapter-title>. In: <source>Proceedings of the 19th International Conference on Data Engineering</source>, pp. <fpage>315</fpage>–<lpage>326</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Ramaswamy</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Rastogi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Shim</surname>, <given-names>K.</given-names></string-name> (<year>2000</year>). <article-title>Efficient algorithms for mining outliers from large data sets</article-title>. <source>SIGMOD Record</source>, <volume>29</volume>(<issue>2</issue>), <fpage>427</fpage>–<lpage>438</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_026">
<mixed-citation publication-type="journal"><string-name><surname>Song</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Takakura</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Okabe</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Nakao</surname>, <given-names>K.</given-names></string-name> (<year>2013</year>). <article-title>Toward a more practical unsupervised anomaly detection system</article-title>. <source>Information Sciences</source>, <volume>231</volume>, <fpage>4</fpage>–<lpage>14</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_027">
<mixed-citation publication-type="book"><string-name><surname>Tan</surname>, <given-names>P.N.</given-names></string-name>, <string-name><surname>Steinbach</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>V.</given-names></string-name> (<year>2005</year>). <source>Introduction to Data Mining</source>. <publisher-name>Addison-Wesley</publisher-name>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_028">
<mixed-citation publication-type="chapter"><string-name><surname>Widyantoro</surname>, <given-names>D.H.</given-names></string-name>, <string-name><surname>Ioerger</surname>, <given-names>T.R.</given-names></string-name>, <string-name><surname>Yen</surname>, <given-names>J.</given-names></string-name> (<year>2002</year>). <chapter-title>An incremental approach to building a cluster hierarchy</chapter-title>. In: <source>Proceedings of the 2002 IEEE International Conference on Data Mining</source>, pp. <fpage>705</fpage>–<lpage>708</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Yerlikaya-Ö</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Askan</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Weber</surname>, <given-names>G.W.</given-names></string-name> (<year>2016</year>). <article-title>A hybrid computational method based on convex optimization for outlier problems: application to earthquake ground motion prediction</article-title>. <source>Informatica</source>, <volume>27</volume>(<issue>4</issue>), <fpage>893</fpage>–<lpage>910</lpage>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_030">
<mixed-citation publication-type="other"><string-name><surname>Basketball-Reference.com</surname></string-name>. Available from: <uri>http://www.basketball-reference.com/</uri>.</mixed-citation>
</ref>
<ref id="j_infor413_ref_031">
<mixed-citation publication-type="other"><string-name><surname>ESPN Fantasy Basketball</surname></string-name>. Available from: <uri>http://www.espn.com/fantasy/basketball/</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>