COMPARATIVE ANALYSIS OF AUTOMATIC TEXT CLASSIFICATION METHODS IN BIBLIOGRAPHIC INFORMATION SYSTEMS

Main Article Content

Nataliya KRASNOSHLYK
Pavlo BOHATYRENKO

Abstract

Introduction. The rapid growth of digital technologies and the increasing intensity
of scientific communication result in a constant expansion of textual data volumes accumulated in
bibliographic information systems (BIS), electronic repositories, and scientometric databases.
Traditional cataloguing approaches based on manual or semi-automatic indexing no longer meet the
scalability and timeliness requirements of modern information environments. An experienced
cataloguer can process only 20–30 documents per day, which is wholly insufficient given the
continuous and rapid growth of scientific output. Research shows that accurate thematic classification
can improve the relevance of search results by 30–40% compared to systems relying solely on full-text
search. Despite a substantial body of work on text classification algorithms, systematic comparative
studies specifically addressing the thematic categorisation of bibliographic records — with
consideration of performance and resource requirements — remain relatively scarce in the literature.
Purpose. The aim of this article is a comparative analysis and experimental evaluation of
automatic text classification methods — from classical machine learning algorithms to neural network
architectures based on transformers — with respect to their effectiveness, performance, and practical
suitability for use in bibliographic information systems.
Results. A modular classification system for bibliographic records was developed in Python
using the scikit-learn, PyTorch, and FastAPI frameworks. Six models were evaluated using accuracy
metrics (Accuracy, Precision, Recall, F1-Score), processing speed, and resource consumption. Among
classical machine learning methods, Support Vector Machines (SVM) achieved the highest quality
(F1-Score = 0.885) while Naive Bayes demonstrated the fastest processing speed (487 samples/sec).
Among deep learning methods, BERT achieved the best classification quality (F1-Score = 0.912) but
requires substantial computational resources (18.5 hours of training, GPU, 8 GB RAM). CNN
provides a good balance between accuracy (F1-Score = 0.876) and training speed (3.2 hours). It was
established that combining title, abstract, and keywords yields 8–12% better results than using any
single field. For the morphologically rich Ukrainian language, lemmatisation using pymorphy2
provides a 2–5% accuracy improvement over stemming.
Conclusion. BERT achieves the highest classification quality (F1-Score = 0.912) but demands
significant resources. SVM provides the best balance for production real-time systems (F1-Score =
0.885, 199 samples/sec, 180 MB RAM). Naive Bayes is optimal for resource-constrained environments
and rapid prototyping. Practical recommendations for choosing a classification method according to
specific requirements for accuracy, speed, resource consumption, and interpretability are formulated.
Multilingual transformer models (XLM-RoBERTa) are recommended for international systems
supporting multiple languages due to their cross-lingual transfer capabilities.

Article Details

How to Cite
KRASNOSHLYK, N., & BOHATYRENKO , P. (2025). COMPARATIVE ANALYSIS OF AUTOMATIC TEXT CLASSIFICATION METHODS IN BIBLIOGRAPHIC INFORMATION SYSTEMS. Cherkasy University Bulletin: Applied Mathematics. Informatics, (1). https://doi.org/10.31651/2076-5886-2025-1-72-85
Section
Інформатика
Author Biographies

Nataliya KRASNOSHLYK, Bohdan Khmelnytsky National University of Cherkasy

Candidate of Technical Sciences, Associate Professor, Department of Applied Mathematics and
Informatics, The Bohdan Khmelnytsky National University of Cherkasy, Ukraine

Pavlo BOHATYRENKO , Bohdan Khmelnytsky National University of Cherkasy

Student, Department of Applied Mathematics and Informatics, The Bohdan Khmelnytsky National
University of Cherkasy, Ukraine

References

Bakhturyn S. V. (2019) Information systems of scientific libraries: current state and development prospects.

Bulletin of the Book Chamber, No. 5, pp. 12-18. (in Ukr.)

DSTU GOST 7.1:2006. Bibliographic record. Bibliographic description. Kyiv: Derzhspozhyvstandart

Ukrainy, 2007. 47 p. (in Ukr.)

Scopus Content Coverage Guide [Electronic resource] // Elsevier. Available at:

https://www.elsevier.com/solutions/scopus.

Breeding M. (2015) Library Services Platforms: A Maturing Genre of Products. Library Technology

Reports, Vol. 51, No. 4, pp. 5-38.

Kowsari K. et al. (2019) Text Classification Algorithms: A Survey. Information, Vol. 10, No. 4, p. 150.

Joachims T. (1998) Text Categorization with Support Vector Machines. Proceedings of ECML, pp. 137-

Chollet F. (2021) Deep Learning with Python. Manning Publications. 504 p.

Devlin J. et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers. Proceedings of NAACLHLT, pp. 4171-4186.

Conneau A. et al. (2020) Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of

ACL, pp. 8440-8451.

Manning C. D., Raghavan P., Schütze H. (2008) Introduction to Information Retrieval. Cambridge

University Press. 506 p.

Korobov M. (2015) Morphological Analyzer and Generator for Russian and Ukrainian. Analysis of Images,

Social Networks and Texts, pp. 320-332.

Pedregosa F. et al. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research,

Vol. 12, pp. 2825-2830.

Paszke A. et al. (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in

Neural Information Processing Systems, Vol. 32, pp. 8024-8035.

Wolf T. et al. (2020) Transformers: State-of-the-art natural language processing. Proceedings of the 2020

Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38-45.

Beltagy I., Lo K., Cohan A. (2019) SciBERT: A Pretrained Language Model for Scientific Text.

Proceedings of EMNLP-IJCNLP, pp. 3615-3620.

Gusenbauer M. (2019) Google Scholar to overshadow them all? Scientometrics, Vol. 118, No. 1, pp. 177-

Kluyver T. et al. (2016) Jupyter Notebooks – a publishing format for reproducible computational

workflows. Positioning and Power in Academic Publishing, pp. 87-90.