A Survey of Retrieval Strategies for OCR Text Collections

A Survey of Retreval Strateges for OCR Text Collectons Steven M. Betzel, Erc C. Jensen, Davd A. Grossman Informaton Retreval Laboratory Department of Computer Scence Illnos Insttute of Technology
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
A Survey of Retreval Strateges for OCR Text Collectons Steven M. Betzel, Erc C. Jensen, Davd A. Grossman Informaton Retreval Laboratory Department of Computer Scence Illnos Insttute of Technology Abstract The mportance of effectvely retrevng OCR text has grown sgnfcantly n recent years. We provde a bref overvew of work done to mprove the effectveness of retreval of OCR text. Introducton As electronc meda becomes more and more prevalent, the need for transferrng older documents to the electronc doman grows. Optcal Character Recognton (OCR) works by scannng source documents and performng character analyss on the resultng mages, gvng a translaton to ASCII text, whch can then be stored and manpulated electroncally lke any standard electronc document. Unfortunately, the character recognton process s not perfect, and errors often occur. These errors have an adverse effect on the effectveness of nformaton retreval algorthms that are based on exact matches of query terms and document terms. Searchng OCR data s essentally a search of nosy or error-flled text. We brefly survey approaches to searchng OCR text. These nclude: defnng an IR model for OCR text collectons, usng an OCR-error aware processng component as part of a text categorzaton system, auto-correcton of errors ntroduced by the OCR process, mprovng strng matchng for nosy data, and ssues n cross-language retreval of mage data. We also note that an excellent survey of the more general area of ndexng and retrevng document mages can be found n [Doer98]. Other recent papers surveyng current approaches n OCR Informaton Retreval appeared n the 2002 SIGIR workshop on OCR Informaton Retreval 1. The followng sectons of ths paper wll examne each of these approaches, and gve a summary of the progress made n each area. IR models for OCR text Most work n the feld of OCR Informaton Retreval has been relatvely recent. A prmary reason for ths s that, especally n early years, obtanng a suffcent quantty of data on whch to test was very dffcult. To get around ths problem, Croft and colleagues publshed one of the frst studes of the effects of OCR data on IR by usng smulated OCR collectons [Crof94]. Ths study found that for hgh qualty OCR converson, not much degradaton n retreval effectveness was encountered, however the retreval of 1 short and low-qualty documents was adversely affected. Loprest and Zhou examned the performance of several models of IR on smulated OCR data, and were able to show that t was plausble to use modfed approaches (they made use of fuzzy logc and approxmate strng matchng) to ncrease effectveness on nosy data [Lopr96]. These conclusons led to the development of models of IR desgned specfcally for operatng on a body of OCR text. Some of the ntal work on developng a model of Informaton Retreval specfcally suted for operatng on a collecton of OCR text was done by [Mtt95]. Ther study nvolved the development of a term weghtng scheme for the probablstc model of nformaton retreval. The motvatons for ths work arose from conclusons reached n [Crof94, Tagh94a, Tagh94b, Tagh96] (and later explored n depth by Mttendorf and colleagues n [Mtt96]), whch show that, although retreval performance s not generally adversely affected by errors n an OCR text collecton, performance degradaton s often observed n cases where there are few documents n the collecton, or the documents n queston are very short. It was also observed that the presence of errors n OCR text tends to lead to unstable and unpredctable retreval performance. In an effort to develop a retreval model that crcumvents these lmtatons, they ncorporate the occurrence probabltes of several knds of typcal OCR errors nto ther term-weghtng schemes. In a set of experments on a collecton of very short documents, they acheved a 23-30% mprovement n retreval effectveness by usng a form of the probablstc model. Ther general equaton for the Retreval Status Value (RSV) of a document s gven n Equaton 1 below: RSV Where: q = query d = document (, d ) ff ( ε, q) = feature frequency n query λ ( q, d ) ff ε = feature frequency n document ff ( ε, q) ff ( ε, d = number of occurrences of feature frequency n document ) / λ Equaton 1 Retreval Status Value for OCR-IR The referenced work presents a number of methods for estmatng varous feature frequences and the probabltes of ther occurrence. Further work has been done n adaptng the probablstc model of IR to handle errorrdden OCR text. Ohta enhanced the probablstc model to take advantage of expected errors n OCR text [Ohta97]. Specfc character transformatons and character occurrence bgrams were used to generate canddate terms for each true search term. Documents retreved by each canddate term are then evaluated for ncluson nto the fnal result set. Ths resulted n mnor mprovements n recall for moderate qualty OCR documents. Another study was performed n [Hard97]. Ths study makes use of n-grams to overcome the problems ntroduced by OCR text, whch s a qute fttng soluton, when consderng that a large percentage of typcal OCR errors nvolve ms-dentfyng a sequence of one or two characters wthn a gven word. Ther basc approach defnes a set of bndng operators over the consttuent n-grams of a word. The goal of these operators s to be strct enough to exclude nose and non-relevant nformaton from the top documents, whle lenent enough to prevent elmnaton of relevant data when a partcular consttuent n-gram s mssng. The expermentaton performed n the study was desgned to dscover whch operators would maxmze retreval performance, and t was found that the passage5 operator, whch ranks documents contanng n-gram components wthn wndows of fve word postons, whle allowng the wndows to cross word boundares was the most effectve approach. It s theorzed that ths s to be expected because of the extremely common OCR error wheren spaces are added to the text n mproper locatons. An example of ths can be seen wth the word envronmental, whch s broken up nto the ngram components (en env env ronm onm ment tal al). The length of ths word makes t a very lkely canddate for ncorrect dentfcaton durng the OCR process va the nserton of one, or several spaces, however, the passage5 operator bnds these components together loosely over a large range that may cross word boundares, helpng to mtgate ths problem. In ther experments, operatng on four databases that were randomly degraded usng data developed at UNLV [Rce93], an mprovement rangng from 5-38% was observed, dependng on the test collecton. The authors used a smlar approach to ad n query term expanson. They used n-grams to dscover canddate expanson terms that were a match or a near-match to terms n the orgnal query. As expected, a further mprovement of 9-18% was observed when usng n-grams for query expanson. Very recently, work has been done that takes common OCR errors nto account when generatng a language model [Jn02]. Ths language model can then be used to approxmate an uncorrupted verson of a partcular document, and t can be used for retreval n a language modelng approach. The authors found sgnfcant mprovement when usng ths approach over other approaches that explctly correct each error found n the source documents. Processng OCR Text for Categorzaton In addton to document retreval, there are other areas of IR that must effectvely handle OCR text. Once such area s Text Categorzaton, wheren a group of documents are examned and assgned to a set of categores, typcally to ad n document browsng or vsualzaton. Some of the early work n ths area s presented n [Hoch94]. Ths study does not deal drectly wth mtgatng problems ntroduced by OCR Text, rather, they descrbe an approach to the development of an automatc ndexng and classfcaton system that uses advanced morphologcal analyss of the text, along wth term frequency analyss, ndex term weghtng, and tranng of the classfer based on a prevouslydefned document model [Deng92]. The results of ths study ndcate that classfcaton performance was nhbted by the degradaton of ndex terms from the OCR process. No soluton to the problem s gven, although t s reasonable to assume that the ncorporaton of a specalzed term weghtng scheme for OCR documents, such as the ones descrbed above, would help to mprove performance. Evdence of ths assumpton s presented n [Cavn94] and [Junk97], wheren advanced technques such as n-gram processng and morphologcal analyss are used to ad n reducng the effect of mperfectons ntroduced by the OCR process on retreval effectveness. A survey of common technques used to enhance effectveness n text categorzaton can be found n [Seba02] Auto-Correcton of OCR errors One valuable area of research nvolves post-processng systems that work to correct the errors that are ntroduced n the OCR process. Ths has far-reachng mplcatons, as beng able to effcently compensate for OCR errors allows conventonal IR technques to be used wthout experencng degradaton n effectveness. Some early work n ths area s dscussed n [Lu91]. Ths study examnes and classfes each type of error that can be ntroduced by the OCR process. Furthermore, t dentfes whch errors are the most typcal and most lkely to ntroduce confuson n the resultng documents. Several technques for correctng the errors are dscussed as well. Dctonary lookups on canddate terms are one of the most course-graned technques used, and they help to dentfy words that have lkely been corrupted. In addton, terms n the source text are broken up nto dgrams, and a frequency matrx s kept to help dentfy whch character sequences are ndcatve of errors. Based on ths, adaptve character converson maps are constructed that allow the system to dentfy a character sequence lkely to be n error, and automatcally correct t by performng a lookup n the map and replacng t wth the corrected verson. These automated technques were performed n concert wth user nteracton, and resulted n a sgnfcantly mproved fnal text that dd not requre nearly as much user nterventon as pror correctve approaches. Another study nvolvng the use of a post processng system for OCR error correcton was performed n [Tagh94a, Tagh94b]. They used technques smlar to the ones mentoned above, wth the addton of a clusterng technque that grouped collectons of msspellngs n wth ther correctly spelled target term. After frequency analyss on the term spellng clusters elmnated unlkely canddates, each msspelled term was replaced wth ts correctly spelled counterpart. A more complete overvew of error correcton technques n post processng systems can be found n [Kuk92]. Improved Strng Matchng on Nosy Data For applcatons where t s desrable to fnd all occurrences of a partcular term, there s the noton of exact strng matchng. When the data s nosy or corrupted, as s the case wth OCR text, exact strng matchng becomes dffcult. Ths problem has been approached by tranng language models to recognze terms that are mproperly spelled, as done n [Coll01]. A more detaled survey of strng matchng approaches can be found n [Nava01]. OCR Issues n Cross-Language Retreval Performng Informaton Retreval on OCR documents n non-englsh languages provdes some nterestng and unque challenges. Oard and colleagues have mplemented a full system for cross-language retreval, and have extended that system to support text from document mages [Oard99]. Bascally, ths system proposes the use of characterconfuson statstcs and character-class recognton algorthms that are specfc to the target language n order to mtgate the ambgutes and errors ntroduced nto a text by the OCR process. More recently, Darwsh and Oard have focused on retreval of OCR d Arabc text, and have found that usng a combnaton of lght-stemmng approaches and character n-grams for the selecton of canddate ndex terms generally provdes the most mprovement n retreval effectveness, and s robust over a varety of OCR errors [Darw01, Darw02]. Summary We have brefly surveyed current technques n use to facltate nformaton retreval on collectons of OCR text. There are a large number of proposed technques and models avalable for use, and hopefully n the future a generalzed soluton that takes on aspects of all avalable technques wll be avalable to members of the Informaton Retreval communty. References [Cavn94] W. Cavnar and J. Trenkle. N-Gram based text categorzaton. In Proceedngs of the 3rd Annual Symposum on Document Analyss and Informaton Retreval, pages , Las Vegas, NV, [Crof94] W. B. Croft, S. M. Hardng, K. Taghva and J. Borsack. An Evaluaton of Informaton Retreval Accuracy wth Smulated OCR Output. Proceedngs of the Symposum on Document Analyss and Informaton Retreval, [Coll01] Kevyn Collns-Thompson and Charles Schwezer and Susan Dumas. Improved strng matchng under nosy channel condtons. Proceedngs of the Tenth Internatonal Conference on Informaton and Knowledge Management (CIKM), [Darw01] Darwsh, Kareem, D. Doermann, R. Jones, D. Oard, and M. Rautanen. TREC-10 Experments at Maryland: CLIR and Vdeo. TREC-2001, [Darw02] Kareem Darwsh and Douglas W. Oard. Term Selecton for Searchng Prnted Arabc. Proceedngs of the Twenty-Ffth Annual Internatonal ACM-SIGIR Conference on Research and Development n Informaton Retreval, [Deng92] A. Dengel, R. Blesnger, R. Hoch, F. Fen, F, Hones. From Paper to Offce Document Standard Representaton. IEEE Computer, vol. 25, no. 7, 1992, pp [Doer98] Davd Doermann. The Indexng and Retreval of Document Images: A Survey, The Journal of Computer Vson and Image Understandng: CVIU, Volume 70, Number 3, [Fras01] Paolo Frascon and Govann Soda and Alessandro Vullo. Categorzaton for mult-page documents: A Hybrd Nave Bayes HMM Approach. Proceedngs of the Frst ACM/IEEE-CS Jont Conference on Dgtal Lbrares, 2001. [Hard97] Stephen M. Hardng and W. Bruce Croft and C. Wer. Probablstc Retreval of {OCR} Degraded Text Usng N-Grams. Proceedngs of the European Conference on Dgtal Lbrares (ECDL), [Hoch94] Raner Hoch. Usng IR technques for text classfcaton n document analyss. Proceedngs of the Seventeenth Annual Internatonal ACM-SIGIR Conference on Research and Development n Informaton Retreval, [Junk97] M. Junker and R. Hoch. Evaluatng OCR and non-ocr text representatons for learnng document classfers. Proceedngs of the Internatonal Conference on Document Analyss and Recognton (ICDAR), [Kuk92] Karen Kukch. Technque for automatcally correctng words n text, ACM Computng Surveys, Vol 24, No. 4, [Lu91] Lon-Mu Lu and Yar M. Babad and We Sun and K-Kan Chan. Adaptve postprocessng of OCR text va knowledge acquston. Proceedngs of the 19th Annual Conference on Computer Scence, [Lopr96] D. Loprest and J. Zhou. Retreval Strateges for Nosy Text. Proceedngs of the Symposum on Document Analyss and Informaton Retreval, [Mtt95] Elke Mttendorf and Peter Schäuble and Párac Sherdan. Applyng probablstc term weghtng to OCR text n the case of a large alphabetc lbrary catalogue, Proceedngs of the 18th Annual nternatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, [Mtt96] E. Mttendorf and P. Schauble. Measurng the effects of data corrupton on nformaton retreval. Proceedngs of the Ffth Annual Symposum on Document Analyss and Informaton Retreval (SDAIR), [Nava01] Gonzalo Navarro. A guded tour to approxmate strng matchng, ACM Computng Surveys, vol 33, no. 1, [Oard99] Douglas W. Oard. Issues n Cross-Language Retreval from Document Image Collectons. In Proceedngs of the 1999 Symposum on Document Image and Understandng Technology, [Ohta97] M. Ohta, A. Takasu, and J. Adach. Retreval Methods for Englsh text wth msrecognzed OCR characters. Proceedngs of the Internatonal Conference on Document Analyss and Recognton, [Rce93] S. Rce, J. Kana, and T. Nartker. An Evaluaton of Informaton Retreval Accuracy. In UNLV Informaton Scence Research Insttute Annual Report (1993), 9-20. [Seba02] Fabrzo Sebastan. Machne learnng n automated text categorzaton, ACM Computng Surveys (CSUR), 34:1, [Tagh94a] Kazem Taghva and Jule Borsack and Allen Condt. Results of applyng probablstc IR to OCR text, proceedngs of the seventeenth annual nternatonal ACM- SIGIR conference on Research and development n nformaton retreval, [Tagh94b] Kazem Taghva, Jule Borsack and Allen Condt. An Expert System for Automatcally Correctng OCR Output, n Proceedngs of the SPIE Document Recognton, [Tagh96] Kazem Taghva and Jule Borsack and Allen Condt. Evaluaton of modelbased retreval effectveness wth OCR text. ACM Transactons on Informaton Systems (TOIS), 14:1, 1996.
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks