Pub. online:1 Jan 2017Type:Research ArticleOpen Access
Journal:Informatica
Volume 28, Issue 1 (2017), pp. 105–130
Abstract
Analysing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, current approaches for data analysis are no longer efficient. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time: the more data points are in the dataset, the bigger the share of time needed for distance calculations.
Crucial to the data analysis and clustering process, however, is that it is rarely straightforward: instead, parameters need to be determined and tuned first. Entirely accurate results are thus rarely needed and instead we can sacrifice little precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach based on approximating DBSCAN. More specifically, we propose two measures to reduce distance calculation overhead and to consequently approximate DBSCAN: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations.
The experiments show that the resulting clustering algorithm is more scalable than the state-of-the-art as the datasets become bigger. Compared with the most recent approximation technique for DBSCAN, our approach is in general one order of magnitude faster (at most 30× in our experiments) as the size of the datasets increase.
Journal:Informatica
Volume 4, Issues 3-4 (1993), pp. 360–383
Abstract
An analytical equation for a generalization error of minimum empirical error classifier is derived for a case when true classes are spherically Gaussian. It is compared with the generalization error of a mean squared error classifier – a standard Fisher linear discriminant function. In a case of spherically distributed classes the generalization error depends on a distance between the classes and a number of training samples. It depends on an intrinsic dimensionality of a data only via initialization of a weight vector. If initialization is successful the dimensionality does not effect the generalization error. It is concluded advantageous conditions to use artificial neural nets are to classify patterns in a changing environment, when intrinsic dimensionality of the data is low or when the number of training sample vectors is really large.