Distributed and Collaborative Web Change Detection System

Víctor M. Prieto1, Manuel Álvarez1, Víctor Carneiro1 and Fidel Cacheda1

  1. Communication and Information Technologies Department
    University of A Coruna, Campus de Elvi˜na s/n - 15071 (A Coruna)
    {victor.prieto, manuel.alvarez, victor.carneiro, fidel.cacheda}@udc.es

Abstract

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present theWeb Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).

Key words

Content refresh, Incremental crawling, Crawling systems and Search engines

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS131120081P

Publication information

Volume 12, Issue 1 (January 2015)
Year of Publication: 2015
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Prieto, V. M., Álvarez, M., Carneiro, V., Cacheda, F.: Distributed and Collaborative Web Change Detection System. Computer Science and Information Systems, Vol. 12, No. 1, 91-114. (2015), https://doi.org/10.2298/CSIS131120081P