DESIGN AND IMPLEMENTATION OF AN INTEGRATED WEB CONTENT MANAGEMENT AND AGGREGATION SYSTEM
Main Article Content
Abstract
Introduction. The modern web is an overwhelming source of information for
research, professional and learning activities. Traditional approaches to content preservation –
bookmarks, screenshots, manual copying – have significant drawbacks. Bookmarks store only URLs
that become useless when pages move or disappear. Screenshots sacrifice searchability. Manual
copying is tedious and strips away metadata. The result is fragmented information scattered across
multiple tools, difficult to search and disconnected from its original context. Browser extensions occupy a unique position in this workflow, operating at the moment when users encounter valuable
content.
Purpose. The primary objective is to design and implement a browser extension for structured
web content capture and management, and to analyse the architectural solutions that ensure effective
extraction, processing and storage of information from web pages.
Results. The Digital Hub browser extension implements a multi-stage content extraction
pipeline that performs semantic analysis of web pages, removal of irrelevant elements and conversion
to Markdown format. The pipeline prioritises semantic CSS selectors for content detection over
scoring-based approaches, providing faster operation on modern HTML5-compliant sites compared
to solutions like Mozilla Readability. A template system with five variable types (common, meta,
schema, selector and LLM prompt), filter chains for value transformation, and trigger patterns for
automatic template selection enables user-defined extraction rules for different content types. The
architecture of the browser extension with five isolated execution contexts and a typed messaging
protocol ensures type safety at compile time. A comparative analysis of existing approaches
demonstrates that no current solution combines automatic structured extraction with a flexible
template system and built-in search.
Conclusion. The combination of automated extraction with user-defined processing rules offers
an effective compromise between fully manual capture and fully automatic systems. The system
provides a functional foundation for personal knowledge management workflows, transforming
ephemeral web pages into structured, searchable notes. Further development directions include
expanding output format support and improving the content scoring algorithm for non-standard page
layouts
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
React Documentation [Electronic resource] // React. – Access mode: https://react.dev. – Title from screen.
TypeScript Documentation [Electronic resource] // TypeScript. – Access mode:
https://www.typescriptlang.org/docs. – Title from screen.
Banks A. Learning React / A. Banks, E. Porcello. – 2nd ed. – Sebastopol : O'Reilly Media, 2020. – 310 p.
WXT Documentation [Electronic resource] // WXT. – Access mode: https://wxt.dev. – Title from screen.
Mantine Documentation [Electronic resource] // Mantine. – Access mode: https://mantine.dev. – Title from
screen.
Mitchell R. Web Scraping with Python: Collecting More Data from the Modern Web / R. Mitchell. – 2nd
ed. – Sebastopol : O'Reilly Media, 2018. – 306 p.
Readability [Electronic resource] // GitHub. – Access mode: https://github.com/mozilla/readability. – Title
from screen.
The Open Graph protocol [Electronic resource]. – Access mode: https://ogp.me. – Title from screen.
Schema.org [Electronic resource]. – Access mode: https://schema.org. – Title from screen.
Gruber J. Markdown: Syntax [Electronic resource] // Markdown Syntax. – Access mode:
https://daringfireball.net/projects/markdown/syntax. – Title from screen.
Turndown [Electronic resource] // GitHub. – Access mode: https://github.com/mixmark-io/turndown. – Title
from screen.
Mustache: Logic-less templates [Electronic resource] // Mustache. – Access mode:
https://mustache.github.io/mustache.5.html. – Title from screen.
Anatomy of an extension [Electronic resource] // MDN Web Docs. – Access mode:
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Anatomy_of_a_WebExtension.
– Title from screen.
Norman D. A. The Design of Everyday Things / D. A. Norman. – MIT Press Ltd, 2014. – 368 p.