Using XPaths of Inbound Links to Cluster Template-Generated Web Pages

Tomas Grigalis1 and Antanas Čenys1

  1. Department of Information Systems, Vilnius Gediminas Technical University
    Sauletekio av. 11, LT–10223 Vilnius, Lithuania
    {tomas.grigalis, antanas.cenys}@vgtu.lt

Abstract

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Key words

Web data extraction, structural clustering, template-generated pages, wrapper induction

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS130416020G

Publication information

Volume 11, Issue 1 (January 2014)
Year of Publication: 2014
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Grigalis, T., Čenys, A.: Using XPaths of Inbound Links to Cluster Template-Generated Web Pages. Computer Science and Information Systems, Vol. 11, No. 1, 111-132. (2014), https://doi.org/10.2298/CSIS130416020G