Scholarworks Repository

A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

Show simple item record Artail H. Fawaz K.
dc.contributor.editor 2008 2017-09-07T07:07:20Z 2017-09-07T07:07:20Z 2008
dc.identifier 10.1016/j.datak.2008.04.003
dc.description.abstract This paper describes a fast HTML web page detection approach that saves computation time by limiting the similarity computations between two versions of a web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that could serve the needs of users in monitoring modifications to HTML web pages made over time, and that allow for reporting and visualizing changes and trends in order to gain insight about the significance and types of such changes. The detection of changes across two versions of a page is accomplished by performing similarity computations after transforming the web page into an XML-like structure in which a node corresponds to an open-close HTML tag. Performance and detection reliability results were obtained, and showed speed improvements when compared to the results of a previous approach. © 2008 Elsevier B.V. All rights reserved.
dc.format.extent Pages: (326-337)
dc.language English
dc.publisher AMSTERDAM
dc.relation.ispartof Publication Name: Data and Knowledge Engineering; Publication Year: 2008; Volume: 66; no. 2; Pages: (326-337);
dc.source Scopus
dc.title A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations
dc.type Article
dc.contributor.affiliation Artail, H., Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon
dc.contributor.affiliation Fawaz, K., Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon
dc.contributor.authorAddress Artail, H.; Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon; email:
dc.contributor.authorCorporate University: American University of Beirut; Faculty: Faculty of Engineering and Architecture; Department: Electrical and Computer Engineering;
dc.contributor.authorDepartment Electrical and Computer Engineering
dc.contributor.authorFaculty Faculty of Engineering and Architecture
dc.contributor.authorInitials Artail, H
dc.contributor.authorInitials Fawaz, K
dc.contributor.authorReprintAddress Artail, H (reprint author), Amer Univ Beirut, Dept Elect and Comp Engn, POB 11-0236, Beirut 11072020, Lebanon.
dc.contributor.authorUniversity American University of Beirut
dc.description.cited Allan J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145-290941.290954; [Anonymous], OPEN DIRECTORY PROJE; [Anonymous], RSS 2 0 SPECIFICATIO; BREWINGTON BE, 2000, P WWW2000 MARCH; CHAKRAVARTHY S, 2002, 2 INT WORKSH WEB DYN; Chawathe S., 1996, ACM SIGMOD INT C MAN, P493; Chawathe S., 1997, ACM SIGMOD RECORD, P26, DOI 10.1145-253262.253266; Cho J, 2000, SIGMOD RECORD, V29, P117; Cobena G, 2002, PROC INT CONF DATA, P41, DOI 10.1109-ICDE.2002.994696; Fetterly D, 2004, SOFTWARE PRACT EXPER, V34, P213, DOI 10.1002-spe.577; Flesca S, 2003, DATA KNOWL ENG, V46, P203, DOI 10.1016-S0169-023X(02)00210-0; Jacob J, 2005, DATA KNOWL ENG, V52, P209, DOI 10.1016-j.datak.2004.05.006; KAIZHONG Z, 1995, 6 ANN S COMB PATT MA, P395; Khoury I, 2007, IEEE T KNOWL DATA EN, V19, P599, DOI 10.1109-TKDE.2007.1014; Kuhn H., 2005, NAV RES LOG, V2, P7; Levering R., 2006, P 2006 ACM S DOC ENG, P198, DOI 10.1145-1166160.1166213; Lim SJ, 2001, PROC INT CONF DATA, P303; Ling Liu, 2002, World Wide Web, V5; LIU L, 2000, 9 INT C INF KNOWL MA, P512; Matloff N., 2005, ACM Transactions on Modeling and Computer Simulation, V15, DOI 10.1145-1103323.1103326; *OP TECHN, COP TRACK PROD; WANG Y, 2003, ICDE, P519; Woodruff A, 1996, COMPUT NETWORKS ISDN, V28, P963, DOI 10.1016-0169-7552(96)00064-5; NETMIND; HTMLDIFF; WEBCQ PRODUCT; HTML TIDY; RSS FEED STAT; WEBSITE WATCHER PROD
dc.description.citedCount 8
dc.description.citedTotWOSCount 4
dc.description.citedWOSCount 4
dc.format.extentCount 12
dc.identifier.coden DKENE
dc.identifier.scopusID 45249103304
dc.publisher.address PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
dc.relation.ispartOfISOAbbr Data Knowl. Eng.
dc.relation.ispartOfIssue 2
dc.relation.ispartofPubTitle Data and Knowledge Engineering
dc.relation.ispartofPubTitleAbbr Data Knowl Eng
dc.relation.ispartOfVolume 66
dc.source.ID WOS:000258448400007
dc.type.publication Journal
dc.subject.otherAuthKeyword Change monitoring
dc.subject.otherAuthKeyword HTML
dc.subject.otherAuthKeyword Similarity computation
dc.subject.otherAuthKeyword Tree similarity
dc.subject.otherAuthKeyword Web page change detection
dc.subject.otherIndex Calculations
dc.subject.otherIndex HTML
dc.subject.otherIndex Information management
dc.subject.otherIndex Markup languages
dc.subject.otherIndex AND detection
dc.subject.otherIndex change detection
dc.subject.otherIndex client applications
dc.subject.otherIndex Computation time
dc.subject.otherIndex detection approach
dc.subject.otherIndex Direct access
dc.subject.otherIndex Elsevier (CO)
dc.subject.otherIndex gain insight
dc.subject.otherIndex In order
dc.subject.otherIndex Server applications
dc.subject.otherIndex web pages
dc.subject.otherIndex World Wide Web
dc.subject.otherKeywordPlus EFFICIENT
dc.subject.otherKeywordPlus ALGORITHM
dc.subject.otherWOS Computer Science, Artificial Intelligence
dc.subject.otherWOS Computer Science, Information Systems

Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Scholarworks


My Account