Pub. online:15 Jun 2023Type:Research ArticleOpen Access
Journal:Informatica
Volume 34, Issue 2 (2023), pp. 285–315
Abstract
Over the past decades, many methods have been proposed to solve the linear or nonlinear mixing of spectra inside the hyperspectral data. Due to a relatively low spatial resolution of hyperspectral imaging, each image pixel may contain spectra from multiple materials. In turn, hyperspectral unmixing is finding these materials and their abundances. A few main approaches to performing hyperspectral unmixing have emerged, such as nonnegative matrix factorization (NMF), linear mixture modelling (LMM), and, most recently, autoencoder networks. These methods use different approaches in finding the endmember and abundance of information from hyperspectral images. However, due to the huge variation of hyperspectral data being used, it is difficult to determine which methods perform sufficiently on which datasets and if they can generalize on any input data to solve hyperspectral unmixing problems. By trying to mitigate this problem, we propose a hyperspectral unmixing algorithm testing methodology and create a standard benchmark to test already available and newly created algorithms. A few different experiments were created, and a variety of hyperspectral datasets in this benchmark were used to compare openly available algorithms and to determine the best-performing ones.
Journal:Informatica
Volume 32, Issue 3 (2021), pp. 441–475
Abstract
This paper is devoted to the problem of class imbalance in machine learning, focusing on the intrusion detection of rare classes in computer networks. The problem of class imbalance occurs when one class heavily outnumbers examples from the other classes. In this paper, we are particularly interested in classifiers, as pattern recognition and anomaly detection could be solved as a classification problem. As still a major part of data network traffic of any organization network is benign, and malignant traffic is rare, researchers therefore have to deal with a class imbalance problem. Substantial research has been undertaken in order to identify these methods or data features that allow to accurately identify these attacks. But the usual tactic to deal with the imbalance class problem is to label all malignant traffic as one class and then solve the binary classification problem. In this paper, however, we choose not to group or to drop rare classes but instead investigate what could be done in order to achieve good multi-class classification efficiency. Rare class records were up-sampled using SMOTE method (Chawla et al., 2002) to a preset ratio targets. Experiments with the 3 network traffic datasets, namely CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) and LITNET-2020 (Damasevicius et al., 2020) were performed aiming to achieve reliable recognition of rare malignant classes available in these datasets.
Popular machine learning algorithms were chosen for comparison of their readiness to support rare class detection. Related algorithm hyper parameters were tuned within a wide range of values, different data feature selection methods were used and tests were executed with and without over-sampling to test the multiple class problem classification performance of rare classes.
Machine learning algorithms ranking based on Precision, Balanced Accuracy Score, $\bar{G}$, and prediction error Bias and Variance decomposition, show that decision tree ensembles (Adaboost, Random Forest Trees and Gradient Boosting Classifier) performed best on the network intrusion datasets used in this research.
Pub. online:23 Mar 2020Type:Research ArticleOpen Access
Journal:Informatica
Volume 31, Issue 1 (2020), pp. 143–160
Abstract
Phishing activities remain a persistent security threat, with global losses exceeding 2.7 billion USD in 2018, according to the FBI’s Internet Crime Complaint Center. In literature, different generations of phishing websites detection methods have been observed. The oldest methods include manual blacklisting of known phishing websites’ URLs in the centralized database, but they have not been able to detect newly launched phishing websites. More recent studies have attempted to solve phishing websites detection as a supervised machine learning problem on phishing datasets, designed on features extracted from phishing websites’ URLs. These studies have shown some classification algorithms performing better than others on differently designed datasets but have not distinguished the best classification algorithm for the phishing websites detection problem in general. The purpose of this research is to compare classic supervised machine learning algorithms on all publicly available phishing datasets with predefined features and to distinguish the best performing algorithm for solving the problem of phishing websites detection, regardless of a specific dataset design. Eight widely used classification algorithms were configured in Python using the Scikit Learn library and tested for classification accuracy on all publicly available phishing datasets. Later, classification algorithms were ranked by accuracy on different datasets using three different ranking techniques while testing the results for a statistically significant difference using Welch’s T-Test. The comparison results are presented in this paper, showing ensembles and neural networks outperforming other classical algorithms.
Journal:Informatica
Volume 23, Issue 3 (2012), pp. 335–355
Abstract
Glaucoma is one of the most insidious eye diseases the occurrence and progression of which a human does not feel. This article provides a brief overview of the eye nerve parameterization methods and algorithms. Parameterization itself is an important task that provides and uniquely defines the structure of the optic nerve disc and further can be used in disease detection or other studies that require a parametric estimate of the eye fundus pattern. So far, planimetric completely automated parameterization of excavation from eye fundus images has not been investigated in detail in the scientific literature. In this article, the authors describe an automated excavation and parameterization algorithm and make the correlation analysis of parameters obtained by both automated and interactive techniques. The obtained results are then compared with those produced by Optical Coherence and Heidelberg Retina Tomography. Finally, the article discusses glaucoma disease detection abilities using the estimated parameters of the eye fundus structures, obtained by different parameterization techniques.
Journal:Informatica
Volume 22, Issue 4 (2011), pp. 507–520
Abstract
The most classical visualization methods, including multidimensional scaling and its particular case – Sammon's mapping, encounter difficulties when analyzing large data sets. One of possible ways to solve the problem is the application of artificial neural networks. This paper presents the visualization of large data sets using the feed-forward neural network – SAMANN. This back propagation-like learning rule has been developed to allow a feed-forward artificial neural network to learn Sammon's mapping in an unsupervised way. In its initial form, SAMANN training is computation expensive. In this paper, we discover conditions optimizing the computational expenditure in visualization even of large data sets. It is shown possibility to reduce the original dimensionality of data to a lower one using small number of iterations. The visualization results of real-world data sets are presented.
Journal:Informatica
Volume 18, Issue 2 (2007), pp. 187–202
Abstract
In this paper, the relative multidimensional scaling method is investigated. This method is designated to visualize large multidimensional data. The method encompasses application of multidimensional scaling (MDS) to the so-called basic vector set and further mapping of the remaining vectors from the analyzed data set. In the original algorithm of relative MDS, the visualization process is divided into three steps: the set of basis vectors is constructed using the k-means clustering method; this set is projected onto the plane using the MDS algorithm; the set of remaining data is visualized using the relative mapping algorithm. We propose a modification, which differs from the original algorithm in the strategy of selecting the basis vectors. The experimental investigation has shown that the modification exceeds the original algorithm in the visualization quality and computational expenses. The conditions, where the relative MDS efficiency exceeds that of standard MDS, are estimated.