ADvaNCE – Efficient and Scalable Approximate Density-Based Clustering Based on Hashing

Li, Tianrun; Heinis, Thomas; Luk, Wayne

doi:10.15388/Informatica.2017.122

Informatica

ADvaNCE – Efficient and Scalable Approximate Density-Based Clustering Based on Hashing

Volume 28, Issue 1 (2017), pp. 105–130

Tianrun Li Thomas Heinis Wayne Luk

https://doi.org/10.15388/Informatica.2017.122

Pub. online: 1 January 2017 Type: Research Article

Open Access

Received
1 September 2016

Accepted
1 February 2017

Published
1 January 2017

Abstract

Analysing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, current approaches for data analysis are no longer efficient. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time: the more data points are in the dataset, the bigger the share of time needed for distance calculations.

Crucial to the data analysis and clustering process, however, is that it is rarely straightforward: instead, parameters need to be determined and tuned first. Entirely accurate results are thus rarely needed and instead we can sacrifice little precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach based on approximating DBSCAN. More specifically, we propose two measures to reduce distance calculation overhead and to consequently approximate DBSCAN: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations.

The experiments show that the resulting clustering algorithm is more scalable than the state-of-the-art as the datasets become bigger. Compared with the most recent approximation technique for DBSCAN, our approach is in general one order of magnitude faster (at most 30× in our experiments) as the size of the datasets increase.

References

Adaszewski, S., Dukart, J., Kherif, F., Frackowiak, R., Draganski, B. (2013). How early can we predict Alzheimer’s disease using computational anatomy? Neurobiology of Aging, 34(12), 2815–2826.

Andoni, A., Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1), 117–122.

Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD’99. ACM, New York, pp. 49–60.

Bache, K., Lichman, M. (2013). UCI Machine Learning Repository.

Bentley, J.L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.

Borah, B., Bhattacharyya, D.K. (2004). An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004, pp. 92–96.

Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (1999). OPTICS-OF: identifying local outliers. In: Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery. PKDD ’99. Springer-Verlag, London, UK, pp. 262–270.

Chen, M.S., Han, J., Yu, P.S. (1996). Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866–883.

Collins, L.M., Dent, C.W. (1988). Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2), 231–242.

Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG’04. ACM, New York, pp. 253–262.

Ester, M., Kriegel, H.P., Sander, J., Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231.

Gan, J., Tao, Y. (2015). DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 519–530.

Gunawan, A. (2013). A Faster Algorithm for DBSCAN. Master’s thesis, Technical University of Eindhoven.

Li, T., Heinis, T., Luk, W. (2016). Hashing-based approximate DBSCAN. In: Proceedings of the 20th European Conference on Advances in Databases and Information Systems, ADBIS’16.

Murray, G., Carenini, G., Ng, R. (2012). Using the omega index for evaluating abstractive community detection. In: Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pp. 10–18.

Patwary, M.M.A., Satish, N., Sundaram, N., Manne, F., Habib, S., Dubey, P. (2014). Pardicle: parallel approximate density-based clustering. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 560–571.

Venetis, T., Ailamaki, A., Heinis, T., Karpathiotakis, M., Kherif, F., Mitelpunkt, A., Vassalos, V. (2015). Towards the identification of disease signatures. In: Proceedings of the 8th International Conference on Brain Informatics and Health BIH.

Viswanath, P., Pinkesh, R. (2006). l-DBSCAN: a fast Hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR’06. Vol. 1, pp. 912–915.

Yeganeh, S.H., Habibi, J., Abolhassani, H., Tehrani, M.A., Esmaelnezhad, J. (2009). An approximation algorithm for finding skeletal points for density based clustering approaches. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 403–410.

Biographies

Li Tianrun

T. Li is currently pursuing a PhD in databases and data mining at the University of Wisconsin.

Heinis Thomas

t.heinis@imperial.ac.uk

T. Heinis is currently a lecturer at Imperial College London. The research of his group focuses on the topics around data management in general and scientific databases, high-performance data analytics as well as approximate computing in particular.

Luk Wayne

W. Luk is a full-time professor at Imperial College London. He leads the Custom Computing Research Group at Imperial focusing on research of reconfigurable hardware. He is a fellow of the Royal Academy of Engineering, IEEE and the British Computer Society. He is also a senior advisor of Maxeler and a Honorary Fellowship Advisor of the Croucher Foundation.

Full article Related articles Cited by

Open access article under the CC BY license.

Keywords

analytics clustering high-dimensional data approximate computing

Metrics

since January 2020

1133

Article info
views

512

Full article
views

570

PDF
downloads

282

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file