Others

Discovering web access patterns and trends by applying OLAP and data mining technology on web logs

Description
Discovering web access patterns and trends by applying OLAP and data mining technology on web logs
Categories
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/3744492 Discovering Web access patterns and trends by applying OLAP anddata mining technology onWeb logs Conference Paper  · May 1998 DOI: 10.1109/ADL.1998.670376 · Source: IEEE Xplore CITATIONS 284 READS 161 3 authors , including:Osmar R. ZaïaneUniversity of Alberta 297   PUBLICATIONS   6,169   CITATIONS   SEE PROFILE Jiawei HanUniversity of Illinois, Urbana-Champaign 574   PUBLICATIONS   51,709   CITATIONS   SEE PROFILE All content following this page was uploaded by Osmar R. Zaïane on 24 April 2015. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.  DiscoveringWebAccessPatternsandTrendsbyApplyingOLAP andDataMiningTechnologyonWebLogs   OsmarR.ZaaneManXinJiaweiHan Virtual-UResearchLaboratory  and  IntelligentDatabaseSystemsResearchLaboratory  SchoolofComputingScience SimonFraserUniversity Burnaby,BC,CanadaV5A1S6 E-mail: f  zaiane,cxin,han  g  @cs.sfu.ca  Abstract AsaconuenceofdataminingandWWWtech-nologies,itisnowpossibletoperformdataminingon weblogrecordscollectedfromtheInternetwebpageaccesshistory.Thebehaviourofthewebpageread-ersisimprintedinthewebserverlogles.Analyzing andexploringregularitiesinthisbehaviourcanimprovesystemperformance,enhancethequalityanddelivery ofInternetinformationservicestotheenduser,and identifypopulationofpotentialcustomersforelectronic commerce.Thus,byobservingpeopleusingcollectionsofdata,dataminingcanbringconsiderablecontribu-tiontodigitallibrarydesigners.InajointeortbetweentheTeleLearning-NCE projecton  VirtualUniversity  andNCE-IRISprojecton  datamining  ,wehavebeendevelopingtheknowledgediscoverytool, WebLogMiner ,forminingwebserverlog les.Thispaperpresentsthedesignofthe WebLog-Miner ,reportsthecurrentprogress,andoutlinesthefutureworkinthisdirection. 1Introduction  WebserversregisteraweblogentryforeverysingleaccesstheygetinwhichtheysavetheURLrequested,theIPaddressfromwhichtherequestoriginated,and atimestamp.WiththerapidprogressofWorld-WideWebWWWtechnology,andtheevergrowingpopu-larityoftheWWW,ahugenumberofWebaccesslog recordsarebeingcollected.Popularwebsitescanseetheirwebloggrowingbyhundredsofmegabytesevery day.Condensingthesecolossallesofrawweblogdata inordertoretrievesignicantandusefulinformationisanontrivialtask.Itisnoteasytoperformsystematicanalysisonsuchahugeamountofdataandtherefore,mostinstitutionshavenotbeenabletomakeeectiveuseofwebaccesshistoryforserverperformanceen-hancement,systemdesignimprovement,orcustomer  ResearchissupportedinpartbytheNaturalSciencesand EngineeringResearchCouncilofCanada,TeleLearning-NCE note:NCEstandsforCanadianNetworksofCentresofEx-cellenceandNCEIRIS-2. targetinginelectroniccommerce.However,manypeo-plehaverealizedthepotentialusageofsuchdata.Usingweblogles,studieshavebeenconductedon analyzingsystemperformance,improvingsystemde-sign,understandingthenatureofwebtrac,andun-derstandinguserreactionandmotivation7,9,18,19.Oneinnovativestudyhasproposedadaptivesites:web sitesthatimprovethemselvesbylearningfromuserac-cesspatterns14.Whileitisencouragingandexciting toseethevariouspotentialapplicationsofweblogleanalysis,itisimportanttoknowthatthesuccessofsuchapplicationsdependsonwhatandhowmuchvalid andreliableknowledgeonecandiscoverfromthelargerawlogdata.Currently,therearemorethan30commercially availableapplicationsforwebloganalysisandmany moreareavailablefreeontheInternetTheuni-versityofIllinoismaintainsalistofwebac-cessanalyzersonaHyperNewspageaccessibleathttp:union.ncsa.uiuc.eduHyperNewsgetwwwlog-analyzers.html.Regardlessoftheirprice,mostofthemaredislikedbytheirusersandconsideredtoo slow,inexibleanddiculttomaintain17.Wecan evenarguethatmostofthesewebloganalysistoolsareverylimitedintheresultstheycanprovide.Af-terreviewingovertwentysuchanalysistools,includ-ingGetstats,Analog,MicrosoftInterseMarketFocus,andWebTrends,wefoundthatmostofthereportsofthesetoolsrevealonlyfrequencycountandorlow conceptualdescriptivelevel.Whilethesetypesofre-portarevaluable,theyarecertainlynotenough.Themostfrequentreportspre-denedbywebloganalysistoolsare:asummaryreportofhitsandbytestrans-ferred,alistoftoprequestedURLs,alistoftoprefer-rers,alistofthemostcommonbrowsersused,hitsperhourdayweekmonthreports,hitsperdomain report,anerrorreport,adirectorytreereport,etc.Despitethefactthatsomeofthereportscanbecus-tomizedwithsomeofthesetools,themajorityofthecurrentlyavailablewebloganalysistoolshaverigid pre-denedreports.Mostifnotalloftheseweblog analysistoolshavelimitationswithregardtothesizeoftheweblogles,whetheritisphysicalinsizeorpracticalintimebecauseofthelowspeedoftheanal-ysis.Toreducethesizeoftheloglestoanalyze,  webloganalysistoolsmakeassumptionsinorderto lteroutsomedatalikefailedrequestsi.e.errorsorpagegraphicrequests,ortoroundothelogentriesbycombiningsimilarrequests.Dierentassumptionsaremadeforeachofthewebloganalysistoolsresult-ingintheprospectofdierentstatisticswiththesamelogle.Ithasbeenreportedin2thattheanalysisofthesameweblogwithdierentwebloganalysistoolsendedupwithdierentstatisticresults.Overall,thecurrentwebloganalysistoolsarestilllimitedintheirperformance,thecomprehensivenessanddepthoftheiranalyses,andthevalidityandreli-abilityoftheirresults.Therecentprogressanddevelopmentofdatamin-inganddatawarehousinghasmadeavailablepower-fuldatamininganddatawarehousingsystems6,4.Manysuccessfuldataminingsystemscanhandlevery largedatalesliketheweblogles.However,wehavenotseenasystematicstudyanddevelopmentofdata warehousingandminingsystemsforminingknowledgefromwebaccesslogrecords.Recentresearchandde-velopmentofdataminingtechnologyhavepromoted somestudiesonecientdataminingforuseraccesspatternsindistributedsystems,referredtoasmining pathtraversalpatterns5.Understandinguseraccesspatternsinawebsiteusingtheseminingtechniquesnotonlyhelpsimprovewebsystemdesign,butalso leadstowisemarketingdecisionse.g.puttingad-vertisementsinproperplaces,classifyingusers,etc..However,asmentionedin4,miningpathtraversalpatternsisstillinitsinfancy.Inthispaperweproposetousedatamining anddatawarehousingtechniquestoanalyzeweblog records.Basedonourexperienceonthedevelop-mentofrelationaldatabaseanddatawarehouse-based dataminingsystem,DBMiner10,bytheIntelli-gentDatabaseSystemsResearchLaboratoryatSimon FraserUniversity,andonthedevelopmentofaWeb-basedcollaborativeteachingandlearningenvironment,Virtual-U,bytheVirtual-UResearchLaboratoryatSi-monFraserUniversity,wejointlystudythechallenging issuesondatamininginweblogdatabases,andpro-posetheWebLogMinersystemwhichperformsdata miningonweblogrecordscollectedfromwebpageac-cesshistory.Theremainingofthispaperisorganizedasfollows.InSection2,wedescribethedesignofadatamin-ingsystemforweblogrecords.Ourimplementation eortsandexperimentsarepresentedinSection3.Fi-nally,Section4summarizesourconclusionsandfutureenhancements. 2DesignofaWeblogminer  Themostcommonlyusedmethodtoevaluateaccessto webresourcesoruserinterestinresourcesisbycount-ingpageaccessesorhits".Aswewillsee,thisisnotsucientandoftennotcorrect.Webserverlog lesofcurrentcommonwebserverscontaininsucientdatauponwhichthoroughanalysiscanbeperformed.However,theycontainusefuldatafromwhichawell-designeddataminingsystemcandiscoverbenecialinformation.Webserverloglescustomarilycontain:thedomain nameorIPaddressoftherequest;theusernameoftheuserwhogeneratedtherequestifapplicable;thedateandtimeoftherequest;themethodoftherequestGETorPOST;thenameofthelerequested;theresultoftherequestsuccess,failure,error,etc.;thesizeofthedatasentback;theURLofthereferring page;andtheidenticationoftheclientagent.Alogentryisautomaticallyaddedeachtimeare-questforaresourcereachesthewebserver.Whilethismayreecttheactualuseoftheresourcesona site,itdoesnotrecordreaderbehaviourslikefrequentbacktrackingorfrequentreloadingofthesameresourcewhentheresourceiscachedbythebrowseroraproxy.Acachewouldstoreresourcesandhandthemtoa clientrequestingthemwithoutleavingatraceinthelogles.Frequentbacktrackingandreloadmaysug-gestadecientdesignofthesitenavigation,which canbeveryinformativeforasitedesigner,however,thiscannotbemeasuredsolelyfromtheserverlogs.Manyhavesuggestedothermeansofdatagathering likeclient-siteloglescollectedbythebrowser,ora JavaApplet.Whilethesetechniquessolveproblemscreatedbypagebacktrackingandproxycaching,they necessitatetheuser'scollaboration,whichisnotalwaysavailable.Untilthewebserverloglesareenriched withmorecollecteddata,ourdataminingprocessissolelybasedontheinformationcurrentlygatheredby thewebservers.Researchersworkingonwebloganalysisdiscredittheuseofwebaccesscountsasindicatorsofuserin-terestormeasureoftheinterestingnessofawebpage.IndeedasdescribedbyFulleranddeGraain7,ac-cesscounts,whenconsideredalone,canbemisleading metrics.Forexample,ifonemustgothroughase-quenceofdocumentstoreachadesireddocument,alldocumentsleadingtothenalonegettheircountersincrementedeveniftheuserisnotinterestedinthem atall.Theaccesscountersalonedonotaccountfortheuser'sabilitytoaccesstheinformationandtheappro-priatenessoftheinformationtotheuser.Nonetheless,whenaccesscountsareusedinconjunctionwithothermetrics,theycanhelpinferinterestingndings.Despitetheimpoverishedstateoftheserverlogs,muchusefulinformationcanbediscoveredwhenus-ingdataminingtechniques.Thedateandtimecol-lectedforeachsuccessiverequestcangiveinteresting cluesregardingtheuserinterestbyevaluatingthetimespentbyusersoneachresource,andcanallowtimesequenceanalysisusingdierenttimevalues:minutes,hours,days,months,years,etc.Thedomainnamecol-lectedcanallowpracticalclassicationoftheresourcesbasedoncountriesortypeofdomaincommercial,ed-ucation,government,etc..Thesequenceofrequestscanhelppredictnextrequestsorpopularrequestsforgivendays,andthushelpimprovethenetworktracbycachingthoseresources,orbyallowingclusteringofresourcesinasitebasedonusermotivation.Notwith-standing,theserverlogscannotbeusedasrecorded bythewebserverandneedtobelteredbeforedata miningcanbeapplied.IntheWebLogMinerproject,thedatacollectedin theweblogsgoesthroughfourstages.Intherststage,thedataislteredtoremoveirrelevantinforma-  tionandarelationaldatabaseiscreatedcontainingthemeaningfulremainingdata.Thisdatabasefacilitatesinformationextractionanddatasummarizationbased onindividualattributeslikeuser,resource,user'slo-cality,day,etc.Inthesecondstage,adatacubeisconstructedusingtheavailabledimensions.On-lineanalyticalprocessingOLAPisusedinthethirdstagetodrill-down,roll-up,sliceanddiceintheweblogdata cube.Finally,inthefourthstage,dataminingtech-niquesareputtousewiththedatacubetopredict,classify,anddiscoverinterestingcorrelations. 2.1Databaseconstructionfromserverlog les:datacleaninganddatatransfor-mation  Thedatalteringstepisatypicalstepadoptedby manywebloganalysistools.Whiletypicallyweblog analysistoolsmaylteroutrequestsforpagegraphicsaswellassoundandvideoinordertoconcentrateon datapertainingtoactualpagehits,wetendtokeep theseentriesbecausewebelievetheycangiveusinter-estingcluesregardingwebsitestructure,tracperfor-mance,aswellasusermotivation.Moreover,oneuseractioncangeneratemultipleserverrequests,andsomeofthemarerequestsforpagemedia.Someoftheserequestsareimportanttodeducetheintendedaction oftheuser.Anothertypicalcleaningprocessconsistsofeliminatinglogentriesgeneratedbywebagentslikewebspiders,indexers,linkcheckers,orotherintelligentagentsthatpre-fetchpagesforcachingpurposes.Wechosenottoscreenouttheserequestsgeneratedbythewebagents.Itisofteninterestingandusefultoanalyzewebagents'behaviouronasiteandcomparethetracgeneratedbytheseautomatedagentswiththerestofthetrac.Thedatalteringweadoptedmainlytrans-formsthedataintoamoremeaningfulrepresentation.Wetendtoconsidermostofthedataarerelevantand eliminateaminimalamountofdata.Therearetwotypesofdatacleaninganddatatrans-formation,onethatdoesnotnecessitateknowledgeabouttheresourcesatthesiteandonethatdoes. Cleaning  thedateandtimeeldofthelogentry,forinstance,doesnotneedanyknowledgeaboutthesiteitself.Thedateandtimeeldissimplyrestructured inasetofeldstospecifytheday,month,year,hour,minuteandseconds.Filteringoutserverrequeststhatfailedortransformingservererrorcodesisalsogeneric.TransformingIPaddressestodomainnamesisinde-pendentfromthesitecontentaswell.However,asso-ciatingaserverrequestorasetofserverrequeststoan intendedactionoreventclearlynecessitatesknowledgeaboutthesitestructure.Moreover,dierentdynami-callygeneratedwebpagescanbetheresultofasinglescript,thus,anidenticalserverrequest.Knowledgeabouttheparametersprovidedtothescripttogener-atethedynamicpage,orknowledgeabouttheneces-sarysequenceintherequesthistorybeforearequestforascript,canbeessentialindisambiguatingaserverrequestandassociatingittoanevent.Metadataprovidedbythesitedesignersisrequired fortheknowledge-baseddatacleaningandtransfor-mation.ThemetadataconsistsofamappingtablebetweenaserverrequestURLwithparameters,ifavailable,orasequenceofrequestsURLsandan eventwitharepresentativeURL.Thetransformation processreplacestherequestsequencebytherepresen-tativeURLandaddstheeventtagtothelogentry.Afterthecleaningandtransformationoftheweblog entries,theweblogisloadedintoarelationaldatabaseandnewimplicitdata,likethetimespentbyevent,iscalculated.Thetimespentbyeventorpageisap-proximatedfromthedierencebetweenthetimethepageforthecurrenteventisrequestedandthetimethenextpageisrequestedwithanupper-boundthreshold forthecasewhentheuserdoesnotcomebacktothesameserver.Thisnotionoftimespentisanapproxi-mationoftheactualperusaldurationsinceitintrinsi-callyincludesthetimefornetworktransfer,navigation insidethepage,etc.Itmayseemabiasedmetricbutcanbeveryusefulcomparingpageswiththesamede-sign. 2.2Multi-dimensionalweblogdatacube constructionandmanipulation  Afterthedatahasbeencleanedandtransformed,amulti-dimensionalarraystructure,calledadata cube,isbuilttoaggregatethehitcounts.Themulti-dimensionaldatacubehasnumerousdimensionsi.e.generallymorethan3,eachdimensionrepresent-ingaeldwithallpossiblevaluesdescribedbyat-tributes.Forexample,thedimension  URL  mayhavetheattributes: serverdomain,directory,lename and  extension  ,orthedimension  time mayhavetheat-tributes: second,minute,hour,day,week,month,quarter,year  .Attributesofadimensionmaybere-latedbypartialorderindicatingahierarchicalrela-tionshipamongthedimensionattributes.Hierarchiesaregenerallypre-dened,butinsomecasespartition-ingthedimensioninrangesautomaticallygeneratesthesehierarchies.Forexamplethedimensionlesizecanbepartitionedintosizerangesandlatergrouped intocategorieslike: tiny,small,medium,large,huge .An  n  -dimensionaldatacube , C   A  1 ;:::;A  n ,isan  n  -Ddatabasewhere A  1 ;:::;A  n are n  dimensions.Each dimensionofthecube, A  i ,representsanattributeand contains j A  i j +1rowswhere j A  i j isthenumberofdistinctvaluesinthedimension  A  i .Therst j A  i j rowsare datarows .Eachdistinctvalueof A  i takesonedatarow.Thelastrow,the sum  row,isused tostorethesummationofthecountsofthecorre-spondingcolumnsoftheaboverows.A  datacell in thecube, C   a  1 ;i 1 ;::;a  n;i n ,storesthe count ofthecor-respondinggeneralizedtupleoftheinitialrelation, r  A  1 =  a  1 ;i 1 ;::;A  n =  a  n;i n ;count .Inthecaseoftheweblogcube, count isforresourcehits.A  sum cell inthecube,suchas C   sum;a  2 ;i 2 ;::;a  n;i n ,where sum  isoftenrepresentedby   orakeyword  all ,stores sum  1 ,the sumofthecounts ofthegeneralizedtupleswhichsharethesamevaluesforallthesecondtothe n  -thcolumns,i.e., r   ;A  2 =  a  2 ;i 2 ;::;A  n =  a  n;i n ;sum  1 .Conceptually,adatacubecanbeviewedasalatticeof cuboids .The n-Dspace i.e., basecuboid  consistsofalldatacellsi.e,no   'sinanydimension.The n-1-Dspace consistsofallthecellswithasingle  in anydimension,suchas r   ;a  2 ;i 2 ;::;a  n;i n ;sum  1 ,and soon.Finally,the 0-Dspace consistsofonecellwith  n   'sinallthe n  dimensions,i.e., r   ;  ;:::;  ;sum  n .A3-DdatacubeisshowninFig.1.Noticesince  A3 1 2 3 41123 A1A2 SumSumSum 2 Figure1:A3-Ddatacubewithsummarylayersalargenumberofthecellsinthecubecanbeempty,tohandlesparsecubeseciently,sparsematrixtech-nologyshouldbeappliedindatacubeconstruction 13,20.Therehavebeenmanymethodsproposedre-centlyforecientcomputationofdatacubes,suchas1,20,whichisbeyondthescopeofthisstudy.Examplesofdimensionsinourweblogdatacubeincludethefollowing.Noticethateachdimensionisdenedonaconcepthierarchytofacilitategeneraliza-tionandspecializationalongthedimension.  URLoftheresource,wheretheconcepthierarchy usedisdenedontheserverdirectorystructure;  typeofresource,denedonapre-builtletypehierarchy;  sizeoftheresource,denedonarangehierarchy;  timeatwhichtheresourcewasrequested,dened onatimehierarchy;  timespentinpage,denedonarangehierarchy ofseconds;  domainnamefromwhichtherequestoriginated,denedonapre-builtdomainhierarchy;  agentthatmadetherequest,denedonapre-builthierarchyofknownwebagentsandbrowsers;  user,denedonapre-builtuserhierarchy;  serverstatus,denedonanerrorcodehierarchy.Themulti-dimensionalstructureofthedatacubeprovidesremarkableexibilitytomanipulatethedata andviewitfromdierentperspectives.Thesum cellsallowquicksummarizationatdierentlevelsoftheconcepthierarchiesdenedonthedimensionat-tributes.Buildingthisweblogdatacubeallowstheappli-cationofOLAPOn-LineAnalyticalProcessingop-erations,suchasdrill-down,roll-up,sliceanddice,to viewandanalyzetheweblogdatafromdierentan-gles,deriveratiosandcomputemeasuresacrossmany dimensions.Thedrill-downoperationnavigatesfromgeneralized datatomoredetails,orspecializesanattributeby steppingdowntheaggregationhierarchy.Forexam-ple,presentingthenumberofhitsgrouped  byday  from thenumberofhitsgrouped  bymonth  ,isadrill-down alongthehierarchytime.Theroll-upisthereverseop-erationofthedrill-down.Itnavigatesfromspecicto general,orgeneralizesanattributebyclimbinguptheaggregationhierarchy.Forexample,theaggregation oftotalrequestsfrom  group-byorganizationbyday  to  group-bycountrybyday  isaroll-upbysummarization overtheserverdomainhierarchy.Thesliceoperationdenesasub-cubebyperforming aselectionononedimensionbyselectingoneorsomevaluesinadimension.Itisaliteralcutofasliceorslicesonthesamedimension.Forexample,theselec-tiondomain=.edu"onthedimensionserverdomain,isasliceontheeducationalinternetdomain.Thediceoperationisasetofconsecutivesliceoperationsonsev-eraldimensions.Itdenesasub-cubebyperforming selectionsonseveraldimensions.Forexample,asub-cubecanbederivedbydicingtheweblogdatacubeon fourdimensionsusingthefollowingclause,country= Canada"andmonth=1197andagent=Mozila"andletype=cgi".TheseOLAPoperationsassistininteractiveandquickretrievalof2Dand3Dcross-tablesandchartabledatafromtheweblogdatacubewhichallowquickqueryingandanalysisofverylargewebaccesshistoryles. 2.3Dataminingonweblogdatacubeand weblogdatabase  On-lineanalyticalprocessingandthedatacubestructureoeranalyticalmodelingcapabilities,includ-ingacalculationengineforderivingvariousstatistics,andahighlyinteractiveandpowerfuldataretrievalandanalysisenvironment.Itispossibletousethisen-vironmenttodiscoverimplicitknowledgeintheweb logdatabasebyimplementingdataminingtechniquesontheweblogdatacube.Theknowledgethatcanbediscoveredisrepresentedintheformofrules,tables,charts,graphs,andothervisualpresentationformsforcharacterizing,comparing,associating,predicting,orclassifyingdatafromthewebaccesslog.Thesedataminingfunctionsarebrieyexplainedasfollows.  Datacharacterization :Thisfunctioncharacterizesdataintheweblog.Itconsistsofndingrulesthatsummarizegeneralcharacteristicsofasetofuser-deneddata.Therulesaregeneratedfroma generalizeddatacubeproducedusingtheweblog datacubeandtheOLAPoperations.Forexam-ple,thetraconawebserverforagiventypeofmediainaparticulartimeofdaycanbesumma-rizedbyacharacteristicrule.  Classcomparison :ComparisonplaystheroleofexaminingtheWeblogdatatodiscoverdiscrim-inantrules,whichsummarizethefeaturesthatdistinguishthedatainthetargetclassfromthatinthecontrastingclasses.Forexample,tocom-parerequestsfromtwodierentwebbrowsersortwowebrobots,adiscriminantrulesummarizesthefeaturesthatdiscriminateoneagentfromtheother,liketime,letype,etc.  Association :Thisfunctionminesassociationrulesintheformof  A  1 ^^  A  i !  B  1 ^^  B  j "atmultiple-levelsofabstraction.Forexample,one
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x