A New Recognition Approach for Logical Link Blocks in Webpages

A New Reogniion Approah for Logial Link Bloks in Webpages X.M. WANG 1, 2, Z.D. WU 1, Y.N. HUANG 3 4, 5*, Q. GU 1 Oujiang College, Wenzhou Universiy, Wenzhou 2 Nework Researh Insiue of Wenzhou, Wenzhou,
A New Reogniion Approah for Logial Link Bloks in Webpages X.M. WANG 1, 2, Z.D. WU 1, Y.N. HUANG 3 4, 5*, Q. GU 1 Oujiang College, Wenzhou Universiy, Wenzhou 2 Nework Researh Insiue of Wenzhou, Wenzhou, Zhejiang, , China 3 kloudsmar, In., 1175 Eagle Cliff Cour, San Jose, CA 95120, U.S.A. 4 Shool of Mahemais and Compuer Siene, Hubei Universiy of Ars and Siene Xiangyang Hubei , China 5 Insiue of Logi and Inelligene, Souhwes Universiy, Chongqing , China Journal of Digial Informaion Managemen ABSTRACT: Link blok is a blok sruure widely exising in webpages. Exising approahes o link bloks reogniion generally suffer from wo drawbaks: 1) hey are designed only aiming a link bloks of physial sruure, and even only aiming a speifi link bloks of blok-level elemens; and 2) he disovery and reogniion of link bloks are based on analyzing HTML ag rees, onsequenly, ofen leading o high ompuing os and hus making hem fail o deal wih he diversified nonsandard webpages on he Inerne. To his end, in his paper we propose he onep of logial link bloks and hen presen an effeive approah o disover and reognize logial link bloks from webpages. In he approah logial link bloks are reognized hrough sanning HTML odes and alulaing he disane beween adjaen links, and hen wo disane hresholds are used o deermine he final logial link bloks. As a resul, he approah no only an be free from he limis of speifi blok-level link bloks, bu also an grealy improve he robusness as he analysis on HTML ag rees is no longer required. Finally, experimenal resuls demonsrae he effeiveness of he proposed approah, whih no only provide a new way for he reogniion of logial link bloks and ex exraion, bu also an be applied in oher web informaion proessing and mining fields due o less demanding for parile size onrol of link bloks. Caegories and Subje Desripors I.2.7 [Naural Language Proessing]: Tex Analysis; I.7 [Doumen and Tex Proessing] General Terms: Informaion Exraion, Experimenaion Keywords:Web, Link blok, Logial Link Blok, Reogniion Reeived: 18 November 2014, Revised 22 Deember 2014, Aeed 27 Deember Inroduion World Wide Web is a super-large omplex nework onsrued by a variey of links beween webpages. Thus, links play a key role in web informaion organizaion, display, page navigaion and so on. A Web rawler realizes he raversal rawl on he Inerne based on links beween webpages, and Inerne users rely on he links beween webpages o realize he luser reading of he onens wih he same opi. Links in webpages are usually organized aording o differen parile sizes. The finer he parile size of link bloks, he higher he opi orrelaion of links, i.e., wih he inreasing of blok parile, he opi ohesion of link bloks gradually weakens. As shown in Figure 1, he page an be lassified ino hree link bloks when he requiremen of he parile size is fine. However, i would be regarded as one link blok when he requiremen of he parile size is no fine, and he use of he whole link blok is navigaion. In relevan sudies of link bloks, he requiremens of he fine degree of link bloks would hange wih differen researh objeives. In he analysis of link bloks, he requiremen of he parile size usually is fine, for example, he link exraion of speifi opi. However, in oher 76 Journal of Digial Informaion Managemen Volume 13 Number 2 April 2015 Figure 1. Parile size of link bloks unspeialized sudies on link bloks, he parile size requiremen is no high, for example, he ex exraion of webpages. In erms of ehnologial implemenaion, visual bloking usually orresponds o some blok-level HTML ag elemens [1] suh as div and able . Consequenly, exising sudies on link bloks fous mainly on his siuaion. However, onsidering he variey of web design ehnology and implemenaion, he implemenaion of visual bloking is no always realized by blok-level HTML ags. Insead, i may be realized by inline HTML ags [1]. Therefore, he implemenaion model of link bloks used by designers anno be known in advane. I needs o be based on he elaborae analysis on HTML ag aribues, whih brings a lo of rouble o some auomaion appliaions based on massive web daa. The res of his paper is organized as follows. Seion 2 desribes relaed work already done in his field. Seion 3 presens an approah o realize he disriminaion and reogniion of logial link bloks. In Seion 4, experimen resuls of he approah are repored and disussed.. Seion 5 Conludes he paper. 2. Relevan Sudies and Problems Sudies on link bloks have a long hisory and many webpage bloking or exraion mehods have been proposed. In Lieraure [2], he webpage exraion mehods were divided ino 5 aegories, i.e., 1) wrappers for onen exraion, 2) emplae deeion for exraing onen, 3) onen exraion using mahine learning, 4) onen exraion using visual ues and/or assumpions and 5) onen exraion based on HTML feaures and/or saisis. These five feaures also an be applied o he bloking of webpage link bloks. Among hem, he universaliy of wrappers for onen exraion and emplae deeion for exraing onen mehod are poor and hey usually need manual work and imely renewal and mainenane, onsequenly making hem ime-onsuming and labor-onsuming. Considering hese faors, some researhers propose wrapper algorihms wihou emplae suppor or human supervision and have ahieved good resuls [3-5]. Conen exraion using mahine learning needs proper raining ses and appropriae feaures [6], so i is diffiul o work wihou human supervision. VIPS [7] is a ypial onen exraion using visual ues and/or assumpions mehod, whih owns high auray bu is requiremen of webpage analysis is oo fine and he alulaion is ime-onsuming. When i is used o deal wih numerous non-sandard webpages, he auray and robusness are diffiul o be guaraneed. Moreover, if widely-used CSS [8] is adoped o onrol he visual presenaion effes of HTML ags, relevan CSS sill needs o be analyzed, whih finally will lead o huge analysis asks and laking program robusness. Conen exraion based on HTML feaures and/or saisis mosly are relevan o some heurisi rules [9,12,15,23] or saisial laws, whose universiy needs improvemens. Besides, researhers also pu forward some exra mehods, for example, a fuzzy-neural nework [10] for he page bloking and MSS [11] page bloking mehod. Alhough relevan mehods are various and have heir respeive feaures, we an onlude ha relevan algorihms of link bloks reogniion are basially based on HTML ag rees [12-16,21,22] or DOM [17]. Oher mehods are also generally based on he HTML ag rees [18,19]. Furhermore, in relevan sudies of webpage bloking, some of hem only speifi o blok-level HTML ags, for example div and able . Due o diversiy and power of able funion [20], webpage layou, modifiaion and onen organizaion in he early sage are almos indispensable and orrespondingly some lieraures only onsider webpages speifi o able layou and fail o disinguish ables for layou and ables for onen organizaion [21]. Speifi o able-designed webpages, Son [21] disinguished and idenified he wo funions of ables. Experimens Journal of Digial Informaion Managemen Volume 13 Number 2 April demonsraed he effeiveness of he proposed mehod. However, limis of he proessing mode only speifi o ables were oo large. Presen webpage design basially oexiss wih div. Uzun [22] onsidered hese wo ondiions and firsly obained he bloking informaion aording o div and d and hen ahieved good effes by ombining wih deision rees o generae exraion rules, espeially obaining he equivalen performane o manual rules in exraion speed. Wang [23] proposed BSU onep and based on his adoped luser and heurisi rules o realize page informaion exraion, onsequenly, resuling in beer performane ompared o he resuls of div-based and able-based mehods. Curren bloking algorihms of link bloks, espeially various HTML ag ree-based mehods, require webpages o obey sri sandards. These sri sandards inlude boh HTML ag grammaial norms (for example, marriage relaion of HTML ags) and norms of semani design aspes (for example, if users see blok-shaped onens on he browser, heir orresponding odes are usually blok-level ags, suh as div and able ; if users see iles, heir orresponding odes are usually h1 and h2 and oher ags wih ile signifiane). In fa, in massive webpages, quie a few webpages don observe HTML ag grammaial norms and semani design norms. Alhough non-sandard HTML ag grammars an be orreed by some exising webpage sandardizaion programs, he auray is diffiul o guaranee. The orreing diffiuly of semani design norms is larger. This deermines ha various mehods based on HTML ag rees an ahieve good resuls only in wo kinds of environmen suh as sandard web pages and nonsandard webpages wih he abiliy of easy orreion, i.e., for hose haphazard web pages, hey will fail o work. Many exising relevan sudies on webpage proessing generally regard he orresponding ode blok of a bloklevel HTML ag as a blok. This proessing mode an grealy improve he proessing effes of massive webpages, bu in fae of numerous and ompliaed webpages, his proessing mode may bring abou wo onsequenes, i.e., misjudgmen or deeion error. For example, in many webpages, here are many non-blok level adverisemens. In he researh field of webpage ex exraion, hese adverisemen links anno be deeed aording o he radiional blok-level mode, shown as Figure 2. Figure 2. Non-blok-level embedded adverisemen links 3. Mehods and Priniples For he onveniene of he following expression, his paper firsly defines he following oneps. In he webpage ode, here are wo kinds of disanes, whih are respeively named ode disane and ex disane. Definiion 1 (Code disane).the ode disane beween arbirary wo HTML ags refers o he lengh of all onens beween HTML ag end mark of he former HTML ag, , and he HTML ag saring mark of he laer, . In he alulaion of his paper, he aribues of eah HTML ag shall be firsly removed and hen he alulaion of ode disanes an be implemened, for example, afer he removal of HTML ag aribues, A id=m abc /a , we an obain A ABC /A . Definiion 2 (Tex disane). The ex disane beween arbirary wo HTML ags refers o he lengh of all exs beween HTML ag end mark of he former HTML ag, , and he HTML ag saring mark of he laer, . However, he alulaion of ex disane shall obey he following rules. (1) English and oher leers ake a word as a saisial uni, namely, one word lengh is noed as 1. (2) Chinese and oher haraers ake one single haraer as a saisial uni, namely, he lengh of one Chinese haraer is noed as 1. (3) Number akes one omplee digi as a saisial uni, namely, he lengh of one omplee digi is noed as 1. For example, he lengh of Bei Jing 2008 is noed as 3. (4) Dae sring akes he whole dae as a saisial uni. Namely, he lengh of a omplee dae is aken as 1. For example, he lengh of Marh 8 h, 2014 is noed as 1. (5) The saisial rule of punuaion marks is he same as ha of Chinese. If several adjaen punuaion marks are he same, he lengh is noed as 1. Definiion 3 (Link disane). I refers o he disane beween wo adjaen links in webpages. Link disane an be measured by ode disane or ex disane. Code disane: namely, he ode disane beween he former link /a and he laer link a . Tex disane: namely, he ex disane beween he former link /a and he laer link a . Definiion 4 (Logial blok). I refers o he oninuous ode area onsiued of a leas one and adjaen or nearby HTML ags. Logial blok may be an HTML ag blok or ombined HTML ag blok onsiued of several adjaen or nearby HTML ag bloks and eah HTML ag inluded in he logial blok is no required omplee and neiher neessarily blok-level HTML ag. Shown as Figure 3, A and B are wo adjaen HTML ags, so hey onsiue a logial blok. A1 and A2 boh belong o he adjaen hild-html ag of A, hey also onsiue a logial blok. A2 and B1 belong o differen paren HTML ags, bu A2 and B1 are adjaen. Through he laer half ode of A and he former half ode of B, A2 and B1 an finally be a oninuous ode region. Thus, i is a logial blok. Definiion 5 (Logial link blok). Suppose he number of links in one logial blok is noed as C link, and he disane beween adjaen links in he logial blok is noed as (d 1, d 2, d Clink 1 ). If his logial blok saisfies he 78 Journal of Digial Informaion Managemen Volume 13 Number 2 April 2015 Figure 3. Logial blok following ondiions, i is alled a logial link blok. C link = C max (d i ) d Where, C represens he minimum link number in link bloks; and d represens he maximum disane beween adjaen links. I means ha o be a logial link blok, he link number shall be no less han C and he disane beween adjaen links shall be no more han d. The disovery of logial link bloks an be realized hrough sanning webpage odes from fron o bak and alulaing he disane beween adjaen links of he disovered links. When he disane is lower han he hreshold d, reord he link number and oninue o san aferward unil he disane beween adjaen links is over d. Judge wheher he presen aumulaed link number exeeds C. If so, i indiaes ha he disovery of one link blok ends and he disovery of he nex link blok sars. The advanage of his disovery approah lies in ha here is no suppor of HTML ag rees, whih means here is no need o os massive ompuing resoures on he analysis of HTML ag rees. This furher avoids various problems of analyzing numerous and ompliaed odes laking in sandards. A presen, here is no available evaluaion index abou he reogniion resuls of logial link bloks. This paper proposes wo indexes, i.e., link overage raio (LCR) and ode overage raio (CCR). C LCR = BlokLinks L, Blok CCR C = PageLinks L Page Where, C BlokLinks represens he oal link number in logial link bloks ha have been reognized; C PageLinks represens he oal link number in he webpage; L Blok represens he oal ode lengh of he logial link bloks; L Page represens he webpage ode lengh. 4. Experimen Design and Resul Analysis 4.1 Experimenal Objeives The following experimens aim a verifying he validiy of he above-menioned reogniion approah of logial link bloks and exploring he effes and haraerisis of his approah in ase of proessing index-ype and onen-ype webpages. 4.2 Experimenal Shemes The original web daa in he following experimens are randomly rawled from he Inerne hrough program and wo modes are adoped for sampling. 1) Manual sreening. The webpage daa of manual sreening are from 20 wellknown web porals, suh as Neease, Sina, Chinanews, e. 16 index webpages (a poral s home page) and 40 onen webpages (deail page abou one subje onen, suh as a news page, a blog page, a video display page, e.) are seleed from eah websie, in oal 1,120 ariles. 2) Random drawing. There are 184 index webpages and 1,024 onen webpages ha are randomly drawn, in oal 1,208. Beause webpage ex exraion may be he mos poenial appliaion of logial link bloks, differen ypes of webpages are seleed as many as possible when sreening onen webpages, inluding boh long ariles and shor ariles, boh pure-ex pages and pages wih piures and videos. The experimens are divided ino wo groups. Eah group respeively adops ode disane and ex disane as he disane measuremen indexes beween links o es he link blok reogniion of index webpages and onen webpages under differen parameer onfiguraions. For he onveniene of he following expression, he link disane hreshold based on ex disane is noed as d and he link disane hreshold based on ode disane is noed as d. The experimenal parameer onfiguraions of wo groups of experimens are as follows. Firs group, suppose C = 3. In ase of ex disane, d = = {5, 10,..., 60}; in ase of ode disane, d = {10, 20,..., 120}. Seond group, suppose C = {2, 3,...,12}. In ase of ex disane, d = 40; in ase of ode disane, d = Experimenal Resuls and Analysis 1) Influenes of d on webpage link bloks Journal of Digial Informaion Managemen Volume 13 Number 2 April Figure 4. Influenes of he link disane hreshold, d, on logial link bloks-index webpages For any webpage, i is no diffiul o imagine ha wih he inreasing disane hreshold, d, he adjaen links will be easier o be involved in he same logial link blok and he logial link blok will be larger. In ase of given oal links, he number of logial link bloks will be less and orrespondingly, he overed oal links and ode areas aumulaed in eah link blok will be more, namely, he higher LCR and CCR will be. The experimenal daa in Figure 4 prove his. Among hem, he subsrips, m, r and a in eah figure respeively represen he manual group, he random group and he orresponding index of all daa. I an be seen from Figure 4 ha, (1) index webpages onain numerous links, bu pure exs in index webpages are exremely few. In no-pure ex or area wih exremely shor ex inerval, all links will be involved in he same logial link blok. Thus, when ex disane is aken as link disane, he logial link bloks are exremely few, espeially when d inreases. (2) Manual sampling daa are from porals wih large webpage volume and omplex sruure. Due o abundan presened informaion and onens as well as omplex olumns, links are numerous. Mos webpages in he random group are of normal size and he presened onens are relaively few and he olumns are simple. Thus, he links are few. Furhermore, beause webpages in he random group are relaively small and he long exs are few, he logial link bloks are obviously fewer han hose in he manual group. (3) In index webpages, when d = 5, LCR is over 90%, indiaing ha he number of pure exs wih he lengh of over 5 in he index webpages is few. These are he well-known ondiions. Namely, various links are spread over index webpages, bu hardly over pure exs. (4) When d 20, here is a signifian differene of CCR beween he manual group and he random group and he differene of LCR urve is relaively small. The reasons are as follows. When LCR is raised o a high level, isolae links or link bloks will beome less. If d inreases, is major funion is o ombine he small logial link bloks separaed by some long exs ino larger link bloks, insead of bringing isolae links or link bloks ino logial link bloks o inrease LCR. I manifess he phagoyosis effe of oher odes ou of links. In he merging proess, logial link bloks beome fewer and he original middle areas beween logial link bloks are wholly brough ino he new logial link bloks. Alhough no or few new links are involved in logial link bloks in his proess, whih may inrease LCR, he involvemen of odes in middle areas beween logial bloks an signifianly inrease LCR. (5) Comparing LCR urve and CCR urve, i an be known ha when d 20, LCR basially mainains unhanged; while when d 45, CCR will remain unhanged. This means ha in index webpages, when 20 d 45, he major onribuion aused by he inrease of d manifess on he phagoyosis effe of non-link odes; when d 25, he inrease of d an synhronously swallow links and odes beween links, furher manifesing he synhronous rise of LCR and CCR. (6) Comparaively speaking, he logial link bloks in he random group are easier o be influened by d. The major reasons lie in ha firsly he webpage links in he random group are few on he whole, usually falling beween dozens and hundreds while poral webpages in he manual group usually onain housands of links. Smaller ardinaliy of links makes i easier o be influened. Seondly, pure exs in webpages in he random group are exremely few, he inrease of d an rapidly luser originally small logial link bloks ino larger ones. Thus, logial link bloks are dereased dramaially, leading o he fluuaion of logial link bloks in he random group is more obvious. Compared wih index webpages, he experimenal resuls of onen webpages show a signifian differene. (1) The number of logial link bloks signifianly dereases, whih is mainly aused by differen funions of onen webpages and index webpages. Index webpages underakes he navigaion funion, inluding more links as muh as possible o bear abundan informaion as muh as possible. While onen webpages fous on onens on one opi, whih may be ex, piure or video. These opi elemens oupy numerous spae, so he 80 Journal of Digial Informaion Managemen Volume 13 Number 2 April 2015 number of links dereases dramaially, furhering leading o he number of he final logial link bloks dereasing dramaially. When d is large enough, he number of logial link bloks in webpages basially mainain beween 2 o 3, among whih mosl
