Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 28, Issue 1 (2017)
  4. ADvaNCE – Efficient and Scalable Approxi ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • Cited by
  • More
    Article info Full article Related articles Cited by

ADvaNCE – Efficient and Scalable Approximate Density-Based Clustering Based on Hashing
Volume 28, Issue 1 (2017), pp. 105–130
Tianrun Li   Thomas Heinis   Wayne Luk  

Authors

 
Placeholder
https://doi.org/10.15388/Informatica.2017.122
Pub. online: 1 January 2017      Type: Research Article      Open accessOpen Access

Received
1 September 2016
Accepted
1 February 2017
Published
1 January 2017

Abstract

Analysing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, current approaches for data analysis are no longer efficient. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time: the more data points are in the dataset, the bigger the share of time needed for distance calculations.
Crucial to the data analysis and clustering process, however, is that it is rarely straightforward: instead, parameters need to be determined and tuned first. Entirely accurate results are thus rarely needed and instead we can sacrifice little precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach based on approximating DBSCAN. More specifically, we propose two measures to reduce distance calculation overhead and to consequently approximate DBSCAN: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations.
The experiments show that the resulting clustering algorithm is more scalable than the state-of-the-art as the datasets become bigger. Compared with the most recent approximation technique for DBSCAN, our approach is in general one order of magnitude faster (at most 30× in our experiments) as the size of the datasets increase.

References

 
Adaszewski, S., Dukart, J., Kherif, F., Frackowiak, R., Draganski, B. (2013). How early can we predict Alzheimer’s disease using computational anatomy? Neurobiology of Aging, 34(12), 2815–2826.
 
Andoni, A., Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1), 117–122.
 
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD’99. ACM, New York, pp. 49–60.
 
Bache, K., Lichman, M. (2013). UCI Machine Learning Repository.
 
Bentley, J.L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.
 
Borah, B., Bhattacharyya, D.K. (2004). An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004, pp. 92–96.
 
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (1999). OPTICS-OF: identifying local outliers. In: Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery. PKDD ’99. Springer-Verlag, London, UK, pp. 262–270.
 
Chen, M.S., Han, J., Yu, P.S. (1996). Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866–883.
 
Collins, L.M., Dent, C.W. (1988). Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2), 231–242.
 
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG’04. ACM, New York, pp. 253–262.
 
Ester, M., Kriegel, H.P., Sander, J., Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231.
 
Gan, J., Tao, Y. (2015). DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 519–530.
 
Gunawan, A. (2013). A Faster Algorithm for DBSCAN. Master’s thesis, Technical University of Eindhoven.
 
Li, T., Heinis, T., Luk, W. (2016). Hashing-based approximate DBSCAN. In: Proceedings of the 20th European Conference on Advances in Databases and Information Systems, ADBIS’16.
 
Murray, G., Carenini, G., Ng, R. (2012). Using the omega index for evaluating abstractive community detection. In: Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pp. 10–18.
 
Patwary, M.M.A., Satish, N., Sundaram, N., Manne, F., Habib, S., Dubey, P. (2014). Pardicle: parallel approximate density-based clustering. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 560–571.
 
Venetis, T., Ailamaki, A., Heinis, T., Karpathiotakis, M., Kherif, F., Mitelpunkt, A., Vassalos, V. (2015). Towards the identification of disease signatures. In: Proceedings of the 8th International Conference on Brain Informatics and Health BIH.
 
Viswanath, P., Pinkesh, R. (2006). l-DBSCAN: a fast Hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR’06. Vol. 1, pp. 912–915.
 
Yeganeh, S.H., Habibi, J., Abolhassani, H., Tehrani, M.A., Esmaelnezhad, J. (2009). An approximation algorithm for finding skeletal points for density based clustering approaches. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 403–410.

Biographies

Li Tianrun

T. Li is currently pursuing a PhD in databases and data mining at the University of Wisconsin.

Heinis Thomas
t.heinis@imperial.ac.uk

T. Heinis is currently a lecturer at Imperial College London. The research of his group focuses on the topics around data management in general and scientific databases, high-performance data analytics as well as approximate computing in particular.

Luk Wayne

W. Luk is a full-time professor at Imperial College London. He leads the Custom Computing Research Group at Imperial focusing on research of reconfigurable hardware. He is a fellow of the Royal Academy of Engineering, IEEE and the British Computer Society. He is also a senior advisor of Maxeler and a Honorary Fellowship Advisor of the Croucher Foundation.


Full article Related articles Cited by PDF XML
Full article Related articles Cited by PDF XML

Copyright
© 2017 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
analytics clustering high-dimensional data approximate computing

Metrics
since January 2020
1106

Article info
views

492

Full article
views

542

PDF
downloads

275

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy