ПРОЕКТУВАННЯ ТА РЕАЛІЗАЦІЯ ІНТЕГРОВАНОЇ СИСТЕМИ УПРАВЛІННЯ ТА АГРЕГАЦІЇ ВЕБ-КОНТЕНТУ

Maksym PEDCHENKO; Oleksandr SERDIUK

doi:10.31651/2076-5886-2025-1-95-106

пдф (Українська)

Published: Dec 29, 2025

DOI: https://doi.org/10.31651/2076-5886-2025-1-95-106

Keywords:

browser extension, web content extraction, knowledge management, template system, Markdown, semantic analysis.

Maksym PEDCHENKO

Bohdan Khmelnytsky National University of Cherkasy

Oleksandr SERDIUK

https://orcid.org/0000-0002-3919-4661

Abstract

Introduction. The modern web is an overwhelming source of information for
research, professional and learning activities. Traditional approaches to content preservation –
bookmarks, screenshots, manual copying – have significant drawbacks. Bookmarks store only URLs
that become useless when pages move or disappear. Screenshots sacrifice searchability. Manual
copying is tedious and strips away metadata. The result is fragmented information scattered across
multiple tools, difficult to search and disconnected from its original context. Browser extensions occupy a unique position in this workflow, operating at the moment when users encounter valuable
content.
Purpose. The primary objective is to design and implement a browser extension for structured
web content capture and management, and to analyse the architectural solutions that ensure effective
extraction, processing and storage of information from web pages.
Results. The Digital Hub browser extension implements a multi-stage content extraction
pipeline that performs semantic analysis of web pages, removal of irrelevant elements and conversion
to Markdown format. The pipeline prioritises semantic CSS selectors for content detection over
scoring-based approaches, providing faster operation on modern HTML5-compliant sites compared
to solutions like Mozilla Readability. A template system with five variable types (common, meta,
schema, selector and LLM prompt), filter chains for value transformation, and trigger patterns for
automatic template selection enables user-defined extraction rules for different content types. The
architecture of the browser extension with five isolated execution contexts and a typed messaging
protocol ensures type safety at compile time. A comparative analysis of existing approaches
demonstrates that no current solution combines automatic structured extraction with a flexible
template system and built-in search.
Conclusion. The combination of automated extraction with user-defined processing rules offers
an effective compromise between fully manual capture and fully automatic systems. The system
provides a functional foundation for personal knowledge management workflows, transforming
ephemeral web pages into structured, searchable notes. Further development directions include
expanding output format support and improving the content scoring algorithm for non-standard page
layouts

How to Cite

PEDCHENKO, M., & SERDIUK, O. (2025). DESIGN AND IMPLEMENTATION OF AN INTEGRATED WEB CONTENT MANAGEMENT AND AGGREGATION SYSTEM. Cherkasy University Bulletin: Applied Mathematics. Informatics, (1). https://doi.org/10.31651/2076-5886-2025-1-95-106

Issue

No. 1 (2025)

Section

Інформатика

This work is licensed under a Creative Commons Attribution 4.0 International License.

Author Biographies

Maksym PEDCHENKO, Bohdan Khmelnytsky National University of Cherkasy

Student, Department of Applied Mathematics and Informatics, The Bohdan Khmelnytsky National
University of Cherkasy, Ukraine

Oleksandr SERDIUK

Candidate of Economic Sciences, Associate Professor, Department of Informatics and Applied
Mathematics, The Bohdan Khmelnytsky National University of Cherkasy, Ukraine

References

React Documentation [Electronic resource] // React. – Access mode: https://react.dev. – Title from screen.

TypeScript Documentation [Electronic resource] // TypeScript. – Access mode:

https://www.typescriptlang.org/docs. – Title from screen.

Banks A. Learning React / A. Banks, E. Porcello. – 2nd ed. – Sebastopol : O'Reilly Media, 2020. – 310 p.

WXT Documentation [Electronic resource] // WXT. – Access mode: https://wxt.dev. – Title from screen.

Mantine Documentation [Electronic resource] // Mantine. – Access mode: https://mantine.dev. – Title from

screen.

Mitchell R. Web Scraping with Python: Collecting More Data from the Modern Web / R. Mitchell. – 2nd

ed. – Sebastopol : O'Reilly Media, 2018. – 306 p.

Readability [Electronic resource] // GitHub. – Access mode: https://github.com/mozilla/readability. – Title

from screen.

The Open Graph protocol [Electronic resource]. – Access mode: https://ogp.me. – Title from screen.

Schema.org [Electronic resource]. – Access mode: https://schema.org. – Title from screen.

Gruber J. Markdown: Syntax [Electronic resource] // Markdown Syntax. – Access mode:

https://daringfireball.net/projects/markdown/syntax. – Title from screen.

Turndown [Electronic resource] // GitHub. – Access mode: https://github.com/mixmark-io/turndown. – Title

from screen.

Mustache: Logic-less templates [Electronic resource] // Mustache. – Access mode:

https://mustache.github.io/mustache.5.html. – Title from screen.

Anatomy of an extension [Electronic resource] // MDN Web Docs. – Access mode:

https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Anatomy_of_a_WebExtension.

– Title from screen.

Norman D. A. The Design of Everyday Things / D. A. Norman. – MIT Press Ltd, 2014. – 368 p.

Article Sidebar

Main Article Content

Abstract

Article Details

Maksym PEDCHENKO, Bohdan Khmelnytsky National University of Cherkasy

Oleksandr SERDIUK

References

Most read articles by the same author(s)