(45-2) 12 * << * >> * Russian * English * Content * All Issues

A nonparametric algorithm for automatic classification of large multivariate statistical data sets and its application
I.V. Zenkov^1,5, A.V. Lapko^2,4, V.A. Lapko^2,4, S.T. Im^1,3,4, V.P. Tuboltsev⁴, V.L. Аvdeenok⁴

¹Siberian Federal University,
660041, Krasnoyarsk, Russia, Svobodny Av. 79,
²Institute of Computational Modelling SB RAS,
660036, Krasnoyarsk, Russia, Akademgorodok 50,
³Sukachev Institute of Forest SB RAS,
660036, Krasnoyarsk, Russia, Akademgorodok 50,
⁴Reshetnev Siberian State University of Science and Technology,
660037, Krasnoyarsk, Russia, Krasnoyarsky Rabochy Av. 31
⁵Krasnoyarsk Branch of the Federal Research Center for Information and Computational Technologies,
660049, Krasnoyarsk, Russia, Mira Av. 53

PDF, 2471 kB

DOI: 10.18287/2412-6179-CO-801

Pages: 253-260.

Full text of article: Russian language.

Abstract:
A nonparametric algorithm for automatic classification of large statistical data sets is proposed. The algorithm is based on a procedure for optimal discretization of the range of values of a random variable. A class is a compact group of observations of a random variable corresponding to a unimodal fragment of the probability density. The considered algorithm of automatic classification is based on the «compression» of the initial information based on the decomposition of a multidimensional space of attributes. As a result, a large statistical sample is transformed into a data array composed of the centers of multidimensional sampling intervals and the corresponding frequencies of random variables. To substantiate the optimal discretization procedure, we use the results of a study of the asymptotic properties of a kernel-type regression estimate of the probability density. An optimal number of sampling intervals for the range of values of one- and two-dimensional random variables is determined from the condition of the minimum root-mean square deviation of the regression probability density estimate. The results obtained are generalized to the discretization of the range of values of a multidimensional random variable. The optimal discretization formula contains a component that is characterized by a nonlinear functional of the probability density. An analytical dependence of the detected component on the antikurtosis coefficient of a one-dimensional random variable is established. For independent components of a multidimensional random variable, a methodology is developed for calculating estimates of the optimal number of sampling intervals for random variables and their lengths. On this basis, a nonparametric algorithm for the automatic classification is developed. It is based on a sequential procedure for checking the proximity of the centers of multidimensional sampling intervals and relationships between frequencies of the membership of the random variables from the original sample of these intervals. To further increase the computational efficiency of the proposed automatic classification algorithm, a multithreaded method of its software implementation is used. The practical significance of the developed algorithms is confirmed by the results of their application in processing remote sensing data.

Keywords:
automatic classification algorithm, multidimensional histogram, regression probability density estimate, discretization of the range of values of a random variable, large samples, antikurtosis coefficient, remote sensing data.

Citation:
Zenkov IV, Lapko AV, Lapko VA, Im ST, Tuboltsev VP, Avdeenok VL. A nonparametric algorithm for automatic classification of large multivariate statistical data sets and its application. Computer Optics 2021; 45(2): 253-260. DOI: 10.18287/2412-6179-CO-801.

Acknowledgements:
The research was funded by RFBR, Krasnoyarsk Territory and Krasnoyarsk Regional Fund of Science, project number 20-41-240001.

References:

Dorofeyuk АА. Algorithms of automatic classification (review) [In Russian]. Automation and Remote Control 1971; 12: 78-113.
Dorofeyuk АА. Methodology of expert classification analysis in the management and processing of complex data (history and prospects of development) [In Russian]. Control Sciences 2009; 3(1): 19-28.
Tsypkin YaZ. Fundamentals of the theory of learning systems [In Russian]. Moscow: “Nauka” Publisher; 1970.
Vasil'ev VI, Ehsh SN. Features of self-learning algorithms and clustering [In Russian]. Control Systems and Machines 2011; 3: 3-9.
Lapko AV, Lapko VA. Nonparametric algorithm of automatic classification under conditions of large-scale statistical data [In Russian]. Informatika i Sistemy Upravleniya 2018; 57(3): 59-70. DOI: 10.22250/isu.2018.57.59-70.
Lapko AV, Lapko VA, Im ST, Tuboltsev VP, Avdeenok VL. Nonparametric algorithm of identification of classes corresponding to single-mode fragments of the probability density of multidimensional random variables. Optoelectronics, Instrumentation and Data Processing 2019; 55(3): 230-236. DOI: 10.3103/S8756699019030038.
Lapko AV, Lapko VA. Regression estimate of the multidimensional probability density and its properties. Optoelectronics, Instrumentation and Data Processing 2014; 50(2): 148-153. DOI: 10.3103/S875669901402006X.
Lapko AV, Lapko VA. Optimal selection of the number of sampling intervals in domain of variation of a one-dimensional random variable in estimation of the probability density. Measurement Techniques 2013; 56(7): 763-767. DOI: 10.1007/s11018-013-0279-x.
Lapko AV, Lapko VA. Selection of the optimal number of intervals sampling the region of values of a two-dimensional random variable. Measurement Techniques 2016; 59(2): 122-126. DOI: 10.1007/s11018-016-0928-y.
Lapko AV, Lapko VA. Discretization method for the range of values of a multi-dimensional random variable. Measurement Techniques 2019; 62(1): 16-22. DOI: 10.1007/s11018-019-01579-0.
Lapko AV, Lapko VA. Estimating the integral of the square of the probability density of a one-dimensional random variable. Measurement Techniques 2020; 63: 534-542. DOI: 10.1007/s11018-020-01820-1.
Kharuk VI, Im ST, Ranson KJ, Yagunov MN. Climate-induced northerly expansion of Siberian silkmoth range. Forests 2017; 8(8): 301. DOI: 10.3390/f8080301.
Kharuk VI, Im ST, Soldatov VV. Siberian silkmoth outbreaks surpassed geoclimatic barrier in Siberian Mountains. Journal of Mountain Science 2020; 17: 1891-1900. DOI: 10.1007/s11629-020-5989-3.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: ko@smr.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20

A nonparametric algorithm for automatic classification of large multivariate statistical data sets and its application I.V. Zenkov 1,5, A.V. Lapko 2,4, V.A. Lapko 2,4, S.T. Im 1,3,4, V.P. Tuboltsev 4, V.L. Аvdeenok 4

A nonparametric algorithm for automatic classification of large multivariate statistical data sets and its application
I.V. Zenkov^1,5, A.V. Lapko^2,4, V.A. Lapko^2,4, S.T. Im^1,3,4, V.P. Tuboltsev⁴, V.L. Аvdeenok⁴