DOI: 10.2298/CSIS100322028W

Research on Discovering Deep Web Entries

Ying Wang1,2, Huilai Li3, Wanli Zuo1,2, Fengling He1,2, Xin Wang1,4 and Kerui Chen1,2

  1. College of Computer Science and Technology, Jilin University
    130012 Changchun, China
  2. Key Laboratory of Computation and Knowledge Engineering,
    Ministry of Education, China
    {wangying2010, zuowl, hefl}@jlu.edu.cn, Chenke0616@163.com
  3. College of Mathematics, Jilin University
    130012 Changchun, China
    lihuilai@jlu.edu.cn
  4. College of Software, Changchun Institute of Technology,
    130012 Changchun, China
    wangxccs@126.com

Abstract

Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potentially interesting pages based on ontology-assisted focused crawler. Then, FSC analyzes the interesting pages and determines whether these pages subsume searchable forms based on structural characteristics. Lastly, FCC identifies searchable forms that belong to a given domain in the semantic level, and stores these URLs of Domain-Specific searchable forms to a database. Through a detailed experimental evaluation, WFF framework not only simplifies discovering process, but also effectively determines Domain-Specific databases.

Key words

Deep Web, ontology, WPC, FSC, FCC

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS100322028W

Publication information

Volume 8, Issue 3 (June 2011)
Year of Publication: 2011
ISSN: 1820-0214 (Print) 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K.: Research on Discovering Deep Web Entries. Computer Science and Information Systems, Vol. 8, No. 3, 779-799. (2011)