Retrieving OCR Text: A Survey of Current Approaches

Retrevng OCR Text: A Survey of Current Approaches Steven M. Betzel, Erc C. Jensen, Davd A. Grossman Informaton Retreval Laboratory Department of Computer Scence Illnos Insttute of Technology
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Retrevng OCR Text: A Survey of Current Approaches Steven M. Betzel, Erc C. Jensen, Davd A. Grossman Informaton Retreval Laboratory Department of Computer Scence Illnos Insttute of Technology Abstract The mportance of effectvely retrevng OCR text has grown sgnfcantly n recent years. We provde a bref overvew of work done to mprove the effectveness of retreval of OCR text. Introducton As electronc meda becomes more and more prevalent, the need for transferrng older documents to the electronc doman grows. Optcal Character Recognton (OCR) works by scannng source documents and performng character analyss on the resultng mages, gvng a translaton to ASCII text, whch can then be stored and manpulated electroncally lke any standard electronc document. Unfortunately, the character recognton process s not perfect, and errors often occur. These errors have an adverse effect on the effectveness of nformaton retreval algorthms that are based on exact matches of query terms and document terms. Searchng OCR data s essentally a search of nosy or error-flled text. We brefly survey approaches to searchng OCR text. These nclude: defnng an IR model for OCR text collectons, usng an OCR-error aware processng component as part of a text categorzaton system, auto-correcton of errors ntroduced by the OCR process, and mprovng strng matchng for nosy data. The followng sectons of ths paper wll examne each of these approaches, and gve a summary of the progress made n each area. IR models for OCR text Some of the ntal work on developng a model of Informaton Retreval specfcally suted for operatng on a collecton of OCR text was done by [Mtt95]. Ther study nvolved the development of a term weghtng scheme for the probablstc model of nformaton retreval. The motvatons for ths work arose from conclusons reached n [Tagh94, Tagh96] (and later explored n depth by Mttendorf and colleagues n [Mtt96]), whch show that, although retreval performance s not generally adversely affected by errors n an OCR text collecton, performance degradaton s often observed n cases where there are few documents n the collecton, or the documents n queston are very short. It was also observed that the presence of errors n OCR text tends to lead to unstable and unpredctable retreval performance. In an effort to develop a retreval model that crcumvents these lmtatons, they ncorporate the occurrence probabltes of several knds of typcal OCR errors nto ther term-weghtng schemes. In a set of experments on a collecton of very short documents, they acheved a 23-30% mprovement n retreval effectveness by usng a form of the probablstc model. Ther general equaton for the Retreval Status Value (RSV) of a document s gven n Equaton 1 below: RSV Where: q = query d = document (, d ) ff ( ε, q) = feature frequency n query ( q, d ) ff ε = feature frequency n document ff ( ε, q) ff ( ε, d = number of occurrences of feature frequency n document ) / Equaton 1 Retreval Status Value for OCR-IR Further work has been done n adaptng the probablstc model of IR to handle errorrdden OCR text n [Hard97]. Ths study makes use of n-grams to overcome the problems ntroduced by OCR text, whch s a qute fttng soluton, when consderng that a large percentage of typcal OCR errors nvolve ms-dentfyng a sequence of one or two characters wthn a gven word. Ther basc approach defnes a set of bndng operators over the consttuent n-grams of a word. The goal of these operators s to be strct enough to exclude nose and non-relevant nformaton from the top documents, whle lenent enough to prevent elmnaton of relevant data when a partcular consttuent n-gram s mssng. The expermentaton performed n the study was desgned to dscover whch operators would maxmze retreval performance, and t was found that the passage5 operator, whch ranks documents contanng n-gram components wthn wndows of fve word postons, whle allowng the wndows to cross word boundares was the most effectve approach. It s theorzed that ths s to be expected because of the extremely common OCR error wheren spaces are added to the text n mproper locatons. An example of ths can be seen wth the word envronmental, whch s broken up nto the ngram components (en env env ronm onm ment tal al). The length of ths word makes t a very lkely canddate for ncorrect dentfcaton durng the OCR process va the nserton of one, or several spaces, however, the passage5 operator bnds these components together loosely over a large range that may cross word boundares, helpng to mtgate ths problem. In ther experments, operatng on four databases that were randomly degraded usng data developed at UNLV [Rce93], an mprovement rangng from 5-38% was observed, dependng on the test collecton. The authors used a smlar approach to ad n query term expanson. They used n-grams to dscover canddate expanson terms that were a match or a near-match to terms n the orgnal query. As expected, a further mprovement of 9-18% was observed when usng n-grams for query expanson. 2 Processng OCR Text for Categorzaton In addton to document retreval, there are other areas of IR that must effectvely handle OCR text. Once such area s Text Categorzaton, wheren a group of documents are examned and assgned to a set of categores, typcally to ad n document browsng or vsualzaton. Some of the early work n ths area s presented n [Hoch94]. Ths study does not deal drectly wth mtgatng problems ntroduced by OCR Text, rather, they descrbe an approach to the development of an automatc ndexng and classfcaton system that uses advanced morphologcal analyss of the text, along wth term frequency analyss, ndex term weghtng, and tranng of the classfer based on a prevouslydefned document model [Deng92]. The results of ths study ndcate that classfcaton performance was nhbted by the degradaton of ndex terms from the OCR process. No soluton to the problem s gven, although t s reasonable to assume that the ncorporaton of a specalzed term weghtng scheme for OCR documents, such as the ones descrbed above, would help to mprove performance. Evdence of ths assumpton s presented n [Cavn94] and [Junk97], wheren advanced technques such as n-gram processng and morphologcal analyss are used to ad n reducng the effect of mperfectons ntroduced by the OCR process on retreval effectveness. A survey of common technques used to enhance effectveness n text categorzaton can be found n [Seba02] Auto-Correcton of OCR errors One valuable area of research nvolves post-processng systems that work to correct the errors that are ntroduced n the OCR process. Ths has far-reachng mplcatons, as beng able to effcently compensate for OCR errors allows conventonal IR technques to be used wthout experencng degradaton n effectveness. Some early work n ths area s dscussed n [Lu91]. Ths study examnes and classfes each type of error that can be ntroduced by the OCR process. Furthermore, t dentfes whch errors are the most typcal and most lkely to ntroduce confuson n the resultng documents. Several technques for correctng the errors are dscussed as well. Dctonary lookups on canddate terms are one of the most course-graned technques used, and they help to dentfy words that have lkely been corrupted. In addton, terms n the source text are broken up nto dgrams, and a frequency matrx s kept to help dentfy whch character sequences are ndcatve of errors. Based on ths, adaptve character converson maps are constructed that allow the system to dentfy a character sequence lkely to be n error, and automatcally correct t by performng a lookup n the map and replacng t wth the corrected verson. These automated technques were performed n concert wth user nteracton, and resulted n a sgnfcantly mproved fnal text that dd not requre nearly as much user nterventon as pror correctve approaches. Another study nvolvng the use of a post processng system for OCR error correcton was performed n [Tagh94]. They used technques smlar to the ones mentoned above, wth the addton of a clusterng technque that grouped collectons of msspellngs n wth ther correctly spelled target term. After frequency analyss on the term spellng clusters elmnated unlkely canddates, each msspelled term was replaced wth ts 3 correctly spelled counterpart. A more complete overvew of error correcton technques n post processng systems can be found n [Kuk92]. Improved Strng Matchng on Nosy Data For applcatons where t s desrable to fnd all occurrences of a partcular term, there s the noton of exact strng matchng. When the data s nosy or corrupted, as s the case wth OCR text, exact strng matchng becomes dffcult. Ths problem has been approached by tranng language models to recognze terms that are mproperly spelled, as done n [Coll01]. A more detaled survey of strng matchng approaches can be found n [Nava01]. Summary We have brefly surveyed current technques n use to facltate nformaton retreval on collectons of OCR text. There are a large number of proposed technques and models avalable for use, and hopefully n the future a generalzed soluton that takes on aspects of all avalable technques wll be avalable to members of the Informaton Retreval communty. References [Cavn94] W. Cavnar and J. Trenkle. N-Gram based text categorzaton. In Prof. of the 3rd Annual Symposum on Document Analyss and Informaton Retreval, pages 161{175, Las Vegas, NV, [Coll01] Kevyn Collns-Thompson and Charles Schwezer and Susan Dumas. Improved strng matchng under nosy channel condtons, Proceedngs of the tenth nternatonal conference on Informaton and knowledge management, [Deng92] A. Dengel, R. Blesnger, R. Hoch, F. Fen, F, Hones. From Paper to Offce Document Standard Representaton. IEEE Computer, vol. 25, no. 7, 1992, pp [Fras01] Paolo Frascon and Govann Soda and Alessandro Vullo. ext categorzaton for mult-page documents: a hybrd nave Bayes HMM approach, Proceedngs of the frst ACM/IEEE-CS ont conference on Dgtal lbrares, [Hard97] Stephen M. Hardng and W. Bruce Croft and C. Wer. Probablstc Retreval of {OCR} Degraded Text Usng N-Grams, European Conference on Dgtal Lbrares (ECDL), [Hoch94] Raner Hoch. Usng IR technques for text classfcaton n document analyss, Proceedngs of the seventeenth annual nternatonal ACM-SIGIR conference on Research and development n nformaton retreval, [Junk97] M. Junker and R. Hoch. Evaluatng OCR and non-ocr text representatons for learnng document classfers. In Proc. ICDAR 97, [Kuk92] Karen Kukch. Technque for automatcally correctng words n text, ACM Computng Surveys (CSUR), 24:4, [Lu91] Lon-Mu Lu and Yar M. Babad and We Sun and K-Kan Chan. Adaptve postprocessng of OCR text va knowledge acquston, Proceedngs of the 19th annual conference on Computer Scence, [Mtt95] Elke Mttendorf and Peter Schäuble and Párac Sherdan. Applyng probablstc term weghtng to OCR text n the case of a large alphabetc lbrary catalogue, {Proceedngs of the 18th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, [Mtt96] E. Mttendorf and P. Schauble. Measurng the effects of data corrupton on nformaton retreval. In Proc. SDAIR'96, [Nava01] Gonzalo Navarro. A guded tour to approxmate strng matchng, ACM Computng Surveys (CSUR), 33:1, [Rce93] S. Rce, J. Kana, and T. Nartker. An Evaluaton of Informaton Retreval Accuracy. In UNLV Informaton Scence Research Insttute Annual Report (1993), [Seba02] Fabrzo Sebastan. Machne learnng n automated text categorzaton, ACM Computng Surveys (CSUR), 34:1, [Tagh94] Kazem Taghva and Jule Borsack and Allen Condt. Results of applyng probablstc IR to OCR text, proceedngs of the seventeenth annual nternatonal ACM- SIGIR conference on Research and development n nformaton retreval, [Tagh96] Kazem Taghva and Jule Borsack and Allen Condt. Evaluaton of modelbased retreval effectveness wth OCR text, ACM Transactons on Informaton Systems (TOIS), 14:1,
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks