USING BOOSTING METHODS FOR MACHINE LEARNING PROBLEMS
Main Article Content
Abstract
Introduction. Among the modern methods of machine learning, one of the most effective are the methods used by ensembles of algorithms. Frequently used ensembles of algorithms include boosting. Boosting is a method of building an ensemble of algorithms, in which the basic algorithms are learned sequentially and each subsequent algorithm of the ensemble is applied to the results of the previous one. Boosting on decision trees remains one of the most effective and popular methods of machine learning. The article describes the idea of gradient and adaptive boosting methods, XGBoost and CatBoost libraries, also considers their practical application for machine learning problems. Examples of the binary classification problems are the problem of predicting Parkinson's disease in a patient and the problem of credit scoring. As an example of the regression problem, consider the problem of predicting the number of bicycles rented per day.
Purpose. The purpose of this paper is to apply and compare different methods of boosting in solving machine learning problems.
Results. Gradient boosting, AdaBoost, XGBoost and CatBoost were used to solve classification and regression problems in the Jupyter Notebook environment. A comparison of the quality of forecasts and the time of learning algorithms in solving the machine learning problems. To assess the quality in the classification problem, the accuracy score оn the test dataset was calculated, and in the regression problem - the coefficient of determination.
The results of computational experiments have shown that, in general, boosting models show quite good results for both classification and regression problems. For binary classification problems, models using gradient boosting show the best results. The XGBoost model demonstrates the shortest training time and the best results, thanks to its hardware and software optimization in the implementation of this library. The CatBoost model has shown better results for the regression problem, but its training time is always an order of magnitude longer than other models.
Conclusion. This article describes the idea of gradient and adaptive boosting methods, XGBoost and CatBoost libraries. A comparison of the quality of forecasts and the time of learning algorithms in solving the problem of medical diagnostics, the problem of credit scoring, the problem of predicting the number of bicycle rent was compared.
The XGBoost model has been found to have the lowest learning time and best results for binary classification problems. For the regression problem, the CatBoost model showed the best results, but its learning time is always an order of magnitude longer than other models.
Article Details
References
Кашницкий Ю. С. Ансамблевый метод машинного обучения, основанный на рекомендации классификаторов / Ю. С. Кашницкий, Д. И. Игнатов // Интеллектуальные системы. Теория и приложения, 2015. - Т. 19. №4, C. 37–55.
Кривохата А. Г. Застосування ансамблевого навчання в задачах класифікації акустичних даних / Кривохата А. Г., Кудін О. В., Давидовський М. В., Лісняк А. О. // Вісник Запорізького національного університету. Фізико-математичні науки, 2018. – №1. – С. 48-60.
Бустинг (Boosting) - Loginom Wiki [Електронний ресурс]. – Режим доступу: https://wiki.loginom.ru/articles/boosting.html.
Ensemble Methods in Machine Learning: Bagging Versus Boosting [Електронний ресурс]. – Режим доступу: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting.
Алгоритм AdaBoost [Електронний ресурс]. – Режим доступу: http://www.machinelearning.ru/wiki/index.php?title=AdaBoost.
Пивкин, К.С. Моделирование покупательского спроса на предприятиях розничной торговли на основе методов машинного обучения. [Текст]: дис. канд. экон. наук: 08.00.13: Ижевск, 2018. - 220 с.
Алгоритм XGBoost: пусть он царствует долго [Електронний ресурс]. – Режим доступу: https://cutt.ly/9nATUGb.
Быстрый градиентный бустинг с CatBoost / Блог компании OTUS / Хабр [Електронний ресурс]. – Режим доступу: https://habr.com/ru/company/otus/blog/527554.
UCI Machine Learning Repository: Parkinsons Data Set [Електронний ресурс]. – Режим доступу: https://archive.ics.uci.edu/ml/datasets/parkinsons.
UCI Machine Learning Repository: Credit Approval Data Set [Електронний ресурс]. – Режим доступу: https://archive.ics.uci.edu/ml/datasets/Credit+Approval.
UCI Machine Learning Repository: Bike Sharing Dataset Data Set [Електронний ресурс]. – Режим доступу: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset.