Study Guides, Notes, & Quizzes

Results of Applying Probabilistic IR to OCR Text

Results of Applying Probabilistic IR to OCR Text l{azem Taghva Julie Borsack Allen Condit unlv. edu unlv. edu Information Science Research Institute University
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Results of Applying Probabilistic IR to OCR Text l{azem Taghva Julie Borsack Allen Condit unlv. edu unlv. edu Information Science Research Institute University of Nevada, Las Vegas Las Vegas, NV89154 USA Abstract Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR s goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using aprobabilistic IR system. Wecompare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness. 1 Introduction Anyone who has performed research in either optical character recognition (OCR) or information retrieval (IR) will attest that each process is complex in its own right. To presuppose what will occur when the two systems are combined would be injudicious at best. The studies we describe, in both this paper and our previous work[l, 2], show the actual effects OCR-generated data has on the IR systems used in our experiments. In our first study [I], we ran a set of queries against two databases. Each database contained the same document collection, but one database had been manually corrected to a level of 99.8~0 correctness while the other database had been automatically generated by an OCR device with no manual correction. In this original set of experiments, we used a boolean exact-match system for our IR environment. We believe using different IR systems for this kind of experimentation can influence the results. Since no relevancy information had been found for this collection, our evaluation consisted only of a comparison between the retrieved query sets. The purpose of our current experiment is to broaden our preliminary testing. The collection we use is larger and rather than using a single OCR device, we use three devices at graded accuracy levels. Our current experiments incorporate a form of the probabilistic IR model and apply a natural language interface[3]. Furthermore, we have collected relevancy information for our larger collection so that precision and recall can be used for evaluation. Expanding our testing in the ways described is one reason for repeating the retrieval experiment. But another reason is the correctness of the documents we use in our tests. Our assumption in our original accuracy study was that the quoted level of correctness was accurate. To make our study more thorough, for all the pages in our expanded collection, we do a manual verification of the corrected text against each hard copy page. This comparison gave us insight into what can be expected from manually corrected document sets. We report on our findings in Section Document Collection Our collection consists of 674 documents (26,467 pages) which are part of a larger collection that was given to the Information Science Research Institute (ISRI) by the Department of Energy (DOE) for continued study in the areas of optical character recognition and information retrieval. The full collection consists of approximately 2,600 (104,000 pages) documents together with their corrected ASCII and page images. These documents make up the Licensing Support System (LSS) prototype. The LSS will capture and track documents that pertain to the site licensing proceedings of the Nuclear Regulatory Commission. As would be expected from this kind of collection, most of the documents deal with technical, scientific topics. The set we use is full of maps, charts, formulas, and comparable graphical material. Although the documents tend toward the scientific, within this domain the collection is diverse. Our set includes 203 topics from rock mining to safety issues for the transportation of nuclear waste. The set we use covers all sixteen established subject areas designated forthe LSS. Characteristics of this document collection which make it particularly useful for demonstrating the ramifications ofusing OCRdata, isitslack ofuniformityin page format, its diverse set of font styles, and its varying grades of hard copy quality. There are almost as many different authoring sources as there are documents. So although the collection is mostly scientific, the broad range of source input gives us a rich collection for the kind of testing we do. This collection consists of full-text documents. The average document length is forty pages and the median length is sixteen pages. For more information about the complete document collection, see [4]. All of the 674 documents were recognized by three different OCR devices to generate the OCR text for our experiment. Our selection criterion for these devices was their level of character accuracy. The three devices we chose reflected the highest accuracy (98.14%), an intermediate accuracy (97.06%), and the lowest character accuracy ( ) of the devices we test at ISRI[5]. For clarity, we use the following designations for the three devices respectively: best, middle, and worst. These designations refer to character accuracy only the qualities of other device features were not considered. The fourth database, designated as correct, consists of the 99.8% corrected ASCII for these 674 documents. Our hypothesis was that at some unknown accuracy level, retrieval rates would become unacceptable. The purpose of our experiments is to see what effects the OCR data will have on the IR systems we use in our testing. To do this kind of evaluation, we must first know the characteristics of not only the OCR-generated text, but also the corrected text to which we compare it. The analysis of our database sets became a considerable part of our experimentation. There are two parts to our analysis: ASCII/OCR document verification and statistical examination of the IR system collections. 2.1 Document Verification The correspondence of the ASCII document to its original hard copy version, and therefore to its image, is essential to our testing. We found from our first set of experiments that this correspondence may not always be what we expect. So for every document in our text database, we compared each hard copy page, its image page, and its ASCII page. Correspondence between a hard copy and an image page is simple to detect since an image is an electronic duplicate of its hard copy. Correspondence between a hard copy and its ASCII equivalent is more difficult to judge. Certain qualities and components of a document cannot be duplicated in its ASCII version, such as enlarged print or a photograph. But at some point, the rules for inclusion or exclusion of document elements must be made. These rules were in place for the corrected ASCII generated for the LSS. Unfortunately, with four different contractors these rules were left up to interpretation. Further, whenever human discretion is incorporated into projects of this magnitude, the results are difficult to predict. We found some unexpected differences when we did our comparison. The most significant difference affecting our experimentation was the exclusion of text from the ASCII files. In some cases, this exclusion was purposeful while in others it was due to negligence. For example, one of the editing rules for the LSS was to replace tables and graphics with a see image tag. In a number of cases, these elements contained a good portion of text. Other discrepancies include the addition of text not part of the original document, erroneous columnization, incorrect page ordering, and deletion of main body text. In a few cases, documents had to be excluded from our testing due to their lack of similarity to the original hard copy version. Although in most cases the corrected ASCII is quite good, the differences we found show that whenever human intervention is introduced, even for correction, some errors and inconsistencies should be expected. For a complete report on our verification procedure and our findings, see [6]. We made no changes to any of the ASCII files in the 99.8 Z0 corrected set. We assume that problems similar to these are inherent in any manual correction procedure and therefore are an integral part of our experimentation. ASCII verification establishes the usefulness of the corrected document collection. We should also examine the qualities of our OCR collections to understand how they affect our results. OCR technology is a complex process whose results are difficult to predict, especially when its input is as multifarious as the collection we use[7, 8]. Further, the algorithms designed for recognition devices are proprietary, making an educated conjecture difficult. To determine the characteristics of the OCR data, each image was compared to its hard copy and the corresponding generated text was inspected. In general, we looked for gross differences between the image and the generated text. For example, if an OCR generated page had blocks of meaningless characters, it could indicate possible problems with the image. These irregularities were checked for all the OCR-generated pages used in our collection. The most notable difference between the recognized document sets and the 99.8% correct version was the size of the data collections (prior to loading). The largest OCR-generated collection was larger (in megabytes) than the corrected set. We attribute this difference to the exclusion of text in the corrected collection and the interjection of what we call graphic text, or strings introduced through mistranslation of graphical 204 page images best middle worst 26,467 26,445 26,409 26,406 Table 1. Number of recognized pages for the three OCR collections statistic correct database size (bytes) 15,686,772 average number best middle worst 37,080,489 40,918,148 42,247,537 9,321 8,583 7,925 terms occurring only once 36,742 terms occurring more than once 41,752 correctly spelled words 22, , , , , , ,572 97, , ,143 25,241 24,728 23,552 Table 2. Statistics for correct, best, middle, and worst collections material. A variance found among the OCR collections was the pages that could be recognized by each of them. With a total of 26,467 page images, the recognized pages of one device were not necessarily a subset of the others. The number of pages recognized by each device can be found in Table 1. Another difference among the devices is their ability to recognize and zone out graphics[7]. If an OCR device has an automatic.zonmg capability, it should differentiate between text and non-text sections (i.e. photographs, maps, etc.); it should translate only those that have been designated as text zones. We found through our testing that this proficiency can be consequential for some IR systems. The lack of accurate zoning not only adds to overhead, it may also limit IR effectiveness. These repercussions will be discussed in more detail later in this paper. All the devices produced good quality output from good quality page images. Divergence in output accuracy became more apparent among the three devices as the quality of the images dropped. 2.2 Collection Statistics In our original experiment with a boolean exact-match system, certain statistics about our collections were interesting but not necessarily a determinant in our results. This is not true for the model we use in this set of tests. In Table 2 we present statistics about each of the four collections. We believe these statistics had an impact on our results. First, note the difference in the database size of the correct set versus any of the OCR collections. All three OCR collections are more than twice the size of the correct set. We attribute a good portion of this inflated size to graphic text, but also for every misspelled word in the OCR collections, there exists a certain amount of overhead. This overhead contributes to its size. The average number of terms per document is another measure of increased magnitude. This number illustrates the average length of the documents in words; it is the total of all occurrences excluding stopwords. Besides correctly spelled words, this is the only decreasing statistic for the best, middle and worst collections respectively. We can explain this downtrend by remembering that the best device recognizes more words correctly. So even though the total number of terms occurring more than once in the best database is lower than the other OCR-generated databases, the total occurrences of these terms is higher. Also, it seems the best database had more duplications of single and double character text strings which were probably generated by graphics. The number of unique indexed terms is a good indicator of the amplified size of the OCR databases. There is a huge discrepancy between the corrected set and the three OCR collections. But even among the OCR databases, there is noticeable variability. Of the OCR collections, the best database has the fewest unique entries; as the accuracy of the device drops, the number of unique index terms increases. These unique term counts demonstrate the amount of overhead required for OCR-generated text: the IR system may have to store 5 times as many terms. Further, these terms will probably be of little use to the user since many are misspellings and graphic text strings that will never be query terms. In the correct database, the number of terms occurring only once is about 47% of the total number of indexed terms. This percentage is comparable to the statistics found for the collections used in TREC- 1[9]. The percentage of terms occurring only once in the OCR collections are disproportionate when compared to the corrected set s percentage. The percentage for these three collections are quite close to each other at above Terms occurring more than once is simply the complement of those terms occurring once. Of course the fraction of these to the total number of indexed terms is much smaller than what we found in our corrected set. Another interesting anomaly is the increase in the number of correctly spelled words found in the OCR collections compared to the corrected collection. This peculiarity shows that information contained in the documents is actually being removed from the manually corrected collection. Also, if we compare the last two rows of Table 2, there seems to be a considerable gap between the number of words occurring more than once and correctly spelled words. The basis for this difference can be attributed to more than one factor. First, there are certainly legitimate words that were missed due to lack of completeness of our dictionary, proper names, acronyms, and the like. Another factor is that misspellings can be duplicated. In Section 3.2, we show this is particularly true with the OCR collections. We use the intersection of the set of correctly spelled words and the set of words occurring more than once as the starting point for our automatic post-processing system described in Section 3.4. Collectively, these statistics show OCR text has distinct characteristics. For IR to work advantageously with OCR, compensation for these characteristics should be made Experimental Environment The document collection we use has considerable bearing on our experiments. The scientific content and diversity of the set makes it a challenge for our experimentation. Other aspects of our testing environment also influence the eventual outcome. The scanner, the recognition devices, the retrieval system, and the queries all affect the results. Following is a brief description of each of these components and how they might affect or be affected by the OCR-generated text. 3.1 Scanning and Recognition Scanning refers to the digitization of an input page into an electronic bit-mapped image[lo]. The most commonly used scanners for text documents are those which binarize the input page. Binarization is the conversion of grey levels to either black or white. The binarization is controlled by a given threshold value. Another setting that can be controlled at scan time is the resolution. These settings will affect the ability of the OCR device to recognize the page[l 1]. For our experiments we used the most commonly applied values for the purpose of character recognition. The images we generated were scanned on the Fujitsu Image Scanner, model number M3096G, at 300 dpi resolution with a median threshold of 127. We are uncertain about the scanning procedures applied for those images generated by the government contractors for the LSS. We do know these images were scanned with either a Rlcoh or Fujitsu scanner at 300 dpi[4]. Wherever possible, we used the images generated by the contractors of the LSS. But in a good portion of the documents, the images were either missing or corrupt. For these documents, we scanned the hard copy pages using the settings described to complete our image set. Each of these page images was automatically zoned and recognized by the three OCR devices. Our environment at ISRI provides a vendor-independent interface t hat cent rols all OCR set tings and selections. Thus, all parameters remained constant for all the devices throughout our testing. For a full description of this interface, refer to [12]. As pointed out in our first paper [1], reduced accuracy should be expected with automatic zoning. But with such a large number of pages and the inability to duplicate the zoning rules prescribed by the LSS, manual zoning would not have been feasible. Automatic zoning influenced other aspects of our experiment. We mentioned that graphic text inflates the size of the database index. This is only one of its side effects. If there is enough noise in a document, it can affect an IR system s ability to determine meaningful document frequencies. Its effect on the IR system we use is described in Section INQUERY Retrieval System The INQUERY Retrieval System applies the foundations of the probabilistic IR model and the notion of inference networks to determine document to query relevance. In general, the probabilistic approach tries to determine this relevance from a priori information about term distributions. Inference nets are specific probabilistic representations that attempt to model document and query content. The belief that documents are relevant to a given query is based on the concepts that have been inferred by the system. The method of assigning concepts to documents and to queries is the fundamental component of INQUERY s architecture. In their simplest form, concepts correspond to terms in the document or 206 query document correct best middle worst J maxtfl rank maxtfl rank maxtf~ rank maxtfl rank I ~~ Table 3. Relationship between maztfj and documentj ranking document ~ correct term image stress veins ore nevada population test doe maxtfj term hsf stress n ore m population e doe best maxtfj middle term hsf stress n ore nevada population ihe doe maztf~ worst term m stress veins fhe nevada population the doe maxt f] Table 4. Most frequent terms and disparity in their maxtfj query[3]. Whether a concept is associated with a specific document depends on both document and collection term frequencies. Formula 1 is a simplification of the formula used by INQUERY to assign concept t to document j[13]. k+ log tz3 log f (l k)* [ log maxt fy * log c 1 (1) where c ~ collection size f, = number of documents in which concept i occurs t%~e frequency of concept i in document j maztfj = maximum concept frequency in document j The parameter k serves to bias the initial belief of term relevance and can be adjusted by the user to fit the collection. We use the default belief value
Similar documents
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks