Main Article Content
Introduction. The task of clustering is to divide the set of objects under study into groups of "similar" objects, called clusters. The purpose of cluster analysis is to find the structures that exist in the data, which is expressed in the formation of groups of similar objects. At the same time, the effect of cluster analysis is to structure the objects under study. This means that clustering techniques are needed to identify a structure in the data that is not easy to find by visual inspection or with the help of experts. However, cluster analysis methods require some a priori information. In particular, before performing a cluster analysis, most methods already need to know the number of clusters. Such information is not always possible due to, for example, the large dimension of the data, so researchers in many cases choose the number of clusters intuitively, which, of course, does not always lead to significant results. Therefore, before clustering, it seems possible to use any other methods that would allow to estimate at least approximately the number of possible clusters.
Purpose. The aim of the study is to determine the possibility of using a specific type of artificial neural networks - a self-organized map of features - to pre-determine the number of clusters for a particular method of clustering - the method of k-means. The tasks set in the study are: reviewing of the k-means clustering algorithm and its implementation; reviewing of a self-organized map of signs and its implementation; to analyze the possibilities of using a self-organized feature map to obtain a priori information on the number of clusters for the k-means algorithm and develop recommendations for the use of self-organized feature maps in data clustering.
Results. One of the simplest clustering algorithms is k-means, which is based on an iterative process of stabilizing cluster centroids. The main characteristic of the cluster is its centroid and all the work of the algorithm is aimed at stabilization or - at best - complete freezing of changes in the centroid of the cluster. The k-means algorithm builds k clusters placed at as large distances from each other as possible. The main type of problem solved by the k-means algorithm is the presence of hypotheses about the existence of a certain number of clusters in the data set, and they should be as different as possible. The choice of the number k can be based on the results of previous research, theoretical considerations or intuition. When specifying the wrong number of clusters, the algorithm generates the wrong result, to avoid which we propose to use one of the methods of preliminary estimation of the possible number of clusters, using the results of neural network with learning without a teacher, which is a self-organizing map - Kohonen network.
The self-organized map of features has a set of input elements, the number of which corresponds to the dimension of the training points, and a set of output elements that act as prototypes. Cluster elements are placed in the form of a one- or two-dimensional array. During the training, all elements can be considered as candidates for the award in the form of training points. For the winning element, the weight values are adjusted so that this cluster element is located even closer to the training point. The weight values of an element are updated if it is inside a circle of a given radius centered in the winning element. During training, the radius usually gradually decreases. In the first stage, the elements are arranged so as to reflect the space of the input elements, and in the second stage, their positions are refined. Typically, the process is represented visually by using two-dimensional data and constructing an appropriate surface.
During our work we made a series of experiments to combine the two above algorithms, using the results of the SOFM network as a priori data for the k-means algorithm. The results of the experiments show the possibility of using the SOFM network to estimate the number of cluster elements. However, the estimate itself is currently considered more complex than simply determining the number of SOFM network outputs with high values of the number of input set points corresponding to them.
Conclusion. As a result of the work we have hext conclusions. 1. The algorithm of k-means is reviewed, the results of work of algorithm implementation in Python language are described, and shortcomings of algorithm are defined. 2. The algorithm of SOFM network learning is reviewed and implemented in Python language. 3. The hypothesis about the possibility of using the SOFM network for preliminary estimation of the number of clusters in a given set of points is formulated and tested. An algorithm is formed and further directions of research are determined.
Hartigan, J.A. (1975) Clustering Algorithms. John Wiley & Sons, New York, 351 p.
Kohonen, T. (1990) The Self-Organizing Map. Proceedings of the IEEE, 78, 1464-1480. http://dx.doi.org/10.1109/5.58325
Осовский С. Нейронные сети для обработки информации / Пер. с польского И.Д. Рудинского. – М.: Финансы и статистика, 2002. – 344 с.
Рашид Т. Создаем нейронную сеть.: Пер. с англ. – СПб.: ООО “Альфа-книга”, 2017. – 272 с.
Хайкин С. Нейронные сети: полный курс, 2-е издание.: Пер. с англ. – М.: Издательский дом «Вильямс», 2006. – 1104 с.