Automobiles

Block addressing indices for approximate text retrieval

Description
Block addressing indices for approximate text retrieval
Categories
Published
of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  BlockAddressingIndicesforApproximateTextRetrieval   RicardoBaeza-YatesGonzaloNavarro DepartmentofComputerScience UniversityofChile BlancoEncalada2120-Santiago-Chile  f  rbaeza,gnavarro  g  @dcc.uchile.cl Abstract Theissueofreducingthespaceoverheadwhenindexinglargetextdatabasesisbecoming moreandmoreimportant,asthetextcollectionsgrowinsize.Anothersubject,whichisgaining importanceastextdatabasesgrowandgetmoreheterogeneousanderrorprone,isthatofexiblestringmatching.Oneofthebesttoolstomakethesearchmoreexibleistoallowalimited numberofdierencesbetweenthewordsfoundandthosesought.Thisiscalled\approximatetextsearching",whichisbecomingmoreandmorepopular.Inrecentyearssomeindexingschemeswithverylowspaceoverheadhaveappeared,someofthemdealingwithapproximatesearching.Theselowoverheadindices(whosemostnotoriousexponentis Glimpse )aremodiedinvertedles,wherespaceissavedbymakingthelistsofoccurrencespointtotextblocksinsteadofexactwordpositions.Despitetheirexistence,littleisknownabouttheexpectedbehaviorofthese\blockaddressing"indices,andevenlessisknown whenitcomestocopewithapproximatesearch.Ourmaincontributionisananalyticalstudyofthespace-timetrade-osforindexedtextsearching.Westudythespaceoverheadandretrievaltimesasfunctionsoftheblocksize.Wendthat,underreasonableassumptions,itispossibletobuildanindexwhichissimultaneously sublinearinspaceoverheadandinquerytime.Thissurprisinganalyticalconclusionisvalidated withextensiveexperiments,obtainingtypicalperformancegures.Theseresultsarevalidforclassicalexactqueriesaswellasforapproximatesearching.WeapplyouranalysistotheWeb,usingrecentstatisticsonthedistributionofthedocu-mentsizes.Weshowthatpointingtodocumentsinsteadoftoxedsizeblocksreducesspacerequirementsbutincreasessearchtimes. 1Introduction  Oneofthemostoutstandingcharacteristicsofmoderntextualdatabasesistheirimpressivesizes.Special-purposecollectionssuchas trec  18]containshundredsofgigabytesinitslastversion.TheWeb,agiganticad-hocdistributedtextcollection,hadmorethan1terabyteestimatedin19985].Handlingtextcollectionsofthesesizesandbeingabletoecientlysearchonthemisbecominga complextask5,36].Firstofall,itisimpossibletosequentiallysearchthewholetextfortheuserspeciedstringsofinterest.Eventhefastestsequentialalgorithmswouldneedfromminutesto hoursforansweringthesimplestqueries,nottomentiontheextraproblemsofadistributedtextconnectedbyrelativelyslowlinkslikelargeportionsoftheWeb.Specializeddatastructuresbuilt  ThisworkhasbeensupportedinpartbyFondef(Chile)grant96-1064. 1   onthetext,called  indices  ,areusedtospeedupthesearch.Thiscausesaspaceproblem,sincenotonlythetexthastobestoredbutalsoitsindex,whichtypicallyneedsfrom20%to200%extra spaceoverthetextsize.However,thisisnottheonlyproblem.Manytextdatabases,liketheWeb,areheterogeneousand error-prone,sincetheystoredatafromdierentsourcesandthereislittleornoqualitycontrolonthem.Errorscomingfrommisspelling,mistypingorfromopticalcharacterrecognition(OCR)areexamplesofagentsthatinevitablydegradethequalityoflargetextdatabases.Wordswhich arestorederroneouslyarenolongerretrievablebymeansofexactqueries.Moreover,eventhequeriesmaycontainerrors,forinstanceinthecaseofmisspellingaforeignnameormissingan accentmark.Thelastyearswitnesseddierentimprovementsonqueryexibility,someofthem regardingthetypeofpatternsthatcanbesearchedfor.Fromtoolsassimpleascasesensitivenesstotheabilitytosearchregularexpressionsandtoallowsomeerrorsinthematches,virtuallyno currentcommercialtextindexislimitedtosimplysearchforexactpatterns.Ofcourse,searching such\extended"patternsisharderthantheclassicalsearchforsimpleones,whichdemandsnew indexingtechniquestospeedupthistask.Someindicesdealingwiththespaceproblemandwithmoreexiblesearchinghaveappearedin recentyears, Glimpse  25]probablybeingthebestknownexponent.Mostofthemaremodied invertedles,basedontheconceptof blockaddressing  :thetextislogicallysplitinblocksand theindexisabletotellwhichblocksawordappearsin,butnotitsexactpositions.Despitetheirexistenceassoftwaresystems,littleisknownabouttheexpectedbehaviorofblockaddressing indices.Evenlessisknownabouttheirperformanceregardingsearchingforextendedpatterns.Inthiswork,westudytheuseofblockaddressingtoobtainindiceswhicharesublinearinspaceandinquerytimesimultaneously,andshowanalyticallyarangeofvalidblocksizestoachievethis.Thecombinedsublinearitymeansthat,asthetextgrows,thespaceoverheadoftheindex andthetimetoansweraquerybecomelessandlesssignicantasaproportionofthetextsize.Thissurprisingresultappliestoclassicalqueriesaswellastoqueriesthatallowerrorsinthematches,whichisoneofthemostimportantclassesofextendedpatterns.Oursisanaveragecaseanalysiswhichgives\big- O  "(i.e.growthrate)resultsandisstronglybasedonsomeheuristicruleswidelyacceptedinInformationRetrieval(IR).Wevalidatetheseanalyticalresultswithextensiveexperiments,obtainingtypicalperformancegures.Finally,weapplyouranalysistoaninterestingparticularcaseofdocumentaddressing(wherethedocumentsareofvariablesize):weuserecentlyobtainedstatisticsfromthedistributionofthepagesizesintheWeb13]andapplyourmachinerytodeterminethespaceoverheadandretrievaltimeofanindexforacollectionofWebpages.Weshowthathavingdocumentsofdierentsizesreducesspacerequirementsintheindexbutincreasessearchtimesifthedocumentshavetobetraversed.Asasideresult,weprovenewrelationsbetweensomelawsofInformationRetrievalwhichwerepreviouslyunnoticed.Wealsostudyexperimentallyhowmanydierentwordsfromatextmatch aquerywhenerrorsareallowedinthematches,andconjecturethatasimilarruleisfollowedby otherextendedpatterns.Thispaperisorganizedasfollows.InSection2weexplaintheissueoftextsearchingallowing errors.InSection3wepresentinvertedlesandtheconceptofblockaddressing.InSection4 2   westudyanalyticallythespace-timetrade-osrelatedtotheblocksize.InSection5wevalidateexperimentallytheanalysis.InSection6weapplyouranalysistotheWebstatistics.Finally,in Section7wegiveourconclusionsandfutureworkdirections.Apreviouspartialversionofthisworkappearedin3]. 2TextSearchingAllowingErrors  Oneofthemaingoalswhensearchingforextendedpatternsistondwordswhoseexactspelling isnotknown.Thisencompassesalsothosetextwordswhichareincorrectlywritten.Theproblem ofcorrectingmisspelledwordsinwrittentextisratherold.Wecouldndreferencesfromthetwenties26],andperhapsthereareolderones.Bythesixties,anumberofad-hocmodelsto matchincorrectlywrittenversionsofawordexisted,likethoseofBlair10],Damerau14]andthepopularSoundexmethod,describedforinstancein21,17].However,sometimeelapseduntilitwasrealized34]thatsuchad-hocmodelswereinferior 1 tosimplevariantsoftheso-called  Levenshtein  (oredit)distance23,24].The editdistance  betweentwostringsisdenedastheminimumnumberofcharacterinsertions,deletionsandreplacementsneededtomakethemequal.Forexample,theeditdistancebetween  "color"  and  "colour"  is1,whilebetween  "survey"  and  "surgery"  is2.Phoneticissuescanbeincorporatedinthisdistance,bychangingthecostofthedierentoperations.Thegoalis,then,tondthetextwordswhichareclose(inthesenseofeditdistance)toagivenpattern.Moreformally,theproblemof approximatestringmatching  (or stringmatchingallowingerrors  )isdenedasfollows:givenatext(ofsize n  )andapattern,retrieveallthesegments(or\occurrences")ofthetextwhose editdistance  tothepatternisatmost k  (thenumberofallowed\errors").Thisproblemhasanumberofotherapplicationsincomputationalbiology,signalprocessing,etc.Thereexistanumberofsolutionsfortheon-lineversionofthisproblem31](i.e.thepatterncan bepreprocessedbutthetextcannot).Allthesealgorithmstraversethewholetextsequentially.Ifthetextdatabaseislarge,eventhefasteston-linealgorithmsarenotpractical,andpreprocessing thetextbecomesmandatory.ThisisnormallythecaseinIR.However,therstindexingschemesforthisproblemareonlyafewyearsold.Therearetwotypesofindexingmechanismsthataddressthisproblem:sequence-retrievingand word-retrieving.Therstonescanretrieveeverymatching  substring  ofthetext,andtheydonotrelyontheconceptofword.Thismakesthemsuitableforapplicationssuchasgeneticdatabases.However,theexistingindicesareratherimmature.Almostallareprototypesinanexperimentalstageandunabletohandlelargevolumesoftext.Theindicestypicallytakefourtotwelvetimesthesizeofthetext.Someexamplesare12,16,22,30,32,33].Word-retrievingindices,althoughlessgeneral,arebettersuitedtonaturallanguageapplicationsandIR.Theyrelyontheconceptofword,andareableofretrievingevery  word  whoseeditdistancetothepatternisatmost k  .Thissimplicationallowstheexistenceofveryecientimplementations.Asallthemareinvertedleswithamodiedsearchtechnique,wedefertheirdiscussiontothenextsection. 1 Semanticsimilarityisadierentissue,notcoveredinthispaper. 3   3InvertedFilesandBlockAddressing  Aninvertedindex(orle)hastwoparts:vocabularyandoccurrences15,5].The vocabulary  ofthetextisthelistofdistinctwordsthatappearinit.The occurrences  contain,foreachvocabulary word,thelistofthetextpositionswherethewordappears.Aclassicalqueryissolvedbysearching thepatterninthevocabulary(usingbinarysearchoranauxiliarydatastructure)andretrieving thelistofitsoccurrences(i.e.thepositionswherethepatternappearsinthetext).Searchtimesareverygood,butthebestimplementationsofinvertedindicesposea20%to30%spaceoverhead overthetextsize. 3.1BlockAddressing  Blockaddressing  isatechniquetoreducethespacerequirementsofaninvertedle.Itwasrstproposedinasystemcalled  Glimpse  (seeSection3.3).Theideaisthatthetextislogicallydivided inblocks,andtheoccurrencesdonotpointtoexactwordpositionsbutonlytotheblockswherethewordappears.Spaceissavedbecausetherearelessblocksthantextpositions(andhencethepointersareshorter),andalsobecausealltheoccurrencesofagivenwordinasingletextblock arereferencedonlyonce.Figure1illustratesablockaddressingindexwith  r  blocksofsize b (i.e. n  =  rb ). b wordsblockof b wordsblockof b wordsblockof r  blocks Text wordsoccurrences Index Figure1:Thewordindexingscheme.Searchinginablockaddressingindexissimilartosearchinginatraditionalone(whichwecall\fullinvertedindex"inthispaper).Thepatternissearchedinthevocabularyandalistofblockswherethepatternappearsisretrieved.However,toobtaintheexactpatternpositionsinthetext,4   asequentialsearchoverthequalifyingblocksbecomesnecessary.Theindexisthereforeusedasa ltertoavoidasequentialsearchoversomeblocks,whiletheothersneedtobechecked.Hence,thereductioninspacerequirementsisobtainedattheexpenseofhighersearchcosts.Atthispointthereadermaywonderwhichistheadvantageofpointingtoarticialblocksinstead ofpointingtodocuments(orles),thiswayfollowingthenaturaldivisionsofthetextcollection.Ifweconsiderthecaseofsimplequeries(say,oneword),wherewearerequiredtoreturnonlythelistofmatchingdocuments,thenpointingtodocumentsisaveryadequatechoice.Moreover,asweseelater,itmayreducespacerequirementswithrespecttousingblocksofxedsize.Also,ifweuseblocksofxedsizeandpackmanyshortdocumentsinalogicalblock,wewillhavetotraversethematchingblocks(evenforthesesimplequeries)todeterminewhichdocumentsinsidetheblock actuallymatched.However,considerthecasewherewearerequiredtodelivertheexactpositionswhichmatcha pattern.Inthiscaseweneedtosequentiallytraversethequalifyingblocksordocumentstond theexactpositions.Moreover,insometypesofqueriessuchasphrasesorproximityqueries,theindexcanonlytellthattwowordsappearinthesameblock,andweneedtotraverseitinorderto determineiftheyformaphrase.Inthiscase,pointingtodocumentsofdierentsizesisnotagoodideabecauselargerdocumentsaresearchedwithhigherprobabilityandsearchingthemcostsmore.Infact,theexpectedcostofthesearchisdirectlyrelatedtothevarianceinthesizeofthepointeddocuments.Thissuggeststhatifthedocumentshavedierentsizesitmaybeagoodideato(logically)partitionlargedocumentsintoblocksandtoputtogethersmalldocuments,suchthatblocksofthesamesizeareused. 3.2SearchingAllowingErrors  Aninvertedlecanbeeasilyconvertedintoanecientword-retrievingindexforapproximatestringmatching.Thisidea,again,wasrstlyproposedfor Glimpse  (seeSection3.3).Tosearchan approximatepatterninthetext,westartbysequentiallyscanningthevocabulary,wordbyword,usinganon-linealgorithm.Thisallowstocollectthesetofdierentwordsthatmatchthequery.Oncethesewordsareknown,theirpositionsinthetextareretrievedandallthelistsaremerged intoasingleone,whichisthenalanswer.Thisschemeworkswellbecausethevocabularygrowsslowlyasthetextgrows,andinlargetextcollectionsittakeslessthan1%ofthetextsize.Thiswell-knownphenomenoninIR,calledHeaps'law19],isdescribedindetaillaterinthispaper.Whencombinedwithblockaddressing,theresultisatwo-stagesequentialsearchprocess.First,thevocabularyissequentiallysearchedandthelistofqualifyingblocksisobtained.Second,each suchblockissequentiallytraversedtondtheactualmatches.Thesecondstep,asexplained,may beabsentifwepointtodocuments,searchforasingleword,andwantonlythelistofqualifying documents.Noticethatthisideacanbeusednotonlyforapproximatesearching,butalsotosearchforany extendedpattern,aslongaswordsinthepatternarematchedtowordsinthetext.5 
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x