Health & Fitness

A New Approach to Annotate the Text's of the Websites and Documents with a Quite Comprehensive Knowledge Base

A New Approach to Annotate the Text's of the Websites and Documents with a Quite Comprehensive Knowledge Base
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
     Abstract — Machine-understandable data when stronglyinterlinked constitutes the basis for the SemanticWeb. Annotatingweb documents is one of the major techniques for creating metadataon the Web. Annotating websites defines the containing data in aform which is suitable for interpretation by machines. In this paper,we present a new approach to annotate websites and documents bypromoting the abstraction level of the annotation process to aconceptual level. By this means, we hope to solve some of theproblems of the current annotation solutions.  Keywords — Knowledge base, ontology, semantic annotation,semantic web.  I.   I NTRODUCTION  EMANTIC annotation is the process of inserting tags in adocument to assign semantics to text fragments allowingcreating the documents processable not only by humans butalso automated agents [8]. The acquisition of masses of metadata for the web content would allow various SemanticWeb applications to emerge and gain wide acceptance. Atpresent there are various Information Extraction (IE)technologies available that allow recognition of named entitieswithin the text, and even the relations, events, and scenarios inwhich they take part. Thus, metadata could be assigned to thedocument, presenting part of its information content, suitablefor further processing. Such metadata can range from formalreference to the author of the document, to annotations of allthe companies and amounts of money referred in the text [13].The approach for automatic (versus manual) extraction of metadata is promising scalable, cheap, author-independent and(potentially) user-specific enrichment of the web content.However, at present there is no technology available toprovide automatic semantic annotation in conceptually clear,intuitive, scalable, and accurate enough fashion. All existingsemantic annotation systems rely on human intervention athole or some point in the annotation process, therefore, theannotation process is manual or semi-automatic. In this paper,we present a new approach to semantic enrichment (annotate)websites and documents by taking the annotation process to a M. Yasrebi is with the Islamic Azad University, Shiraz, Iran (phone:+98917-714-0793; e-mail: Mohsenzadeh, is with the Islamic Azad University – Science andresearch branch, Tehran, Iran (e-mail: Abbasi-Dezfuli is with the Islamic Azad University – Science andresearch branch, Ahwaz, Iran (e-mail: conceptual level and by integrating it into an existingknowledge base "WordNet". This approach is semi-automaticsystem.By researching about methods and existing semanticannotation platforms we observe that all of these methods areusing the source of information which is named knowledgebase to define the concepts and semantics of words in texts.The knowledge bases which are used in these tools aredefective and unable to define the concepts of some words.So, the idea of using extended knowledge base with moreknowledge and information in most domains came to exist andis able to be complete more and more. In our developedapproach there is no need for manual information extraction.It is not based on learning human-created samples either. Theidea of information extraction lies in the concept of knowledge base, including a complete set of words, thecollections of grammars, data frames and various lists of entities.So, first of all, we discuss about the considered knowledgebases and then the function of our approach.This paper is structured as follows. Section II discuss aboutthe existing knowledge bases in our approach. Sections III andIV define the architecture of knowledge bases and present themodel of our semantic annotation system and define thedifferent stages in the annotation process. Section V includingevaluation and conclusions are drawn in section VI.II.   T HE R OLE OF K NOWLEDGE BASES IN OUR A PPROACH  In this approach, two different knowledge bases used asfollow: −   Primary knowledge base −   Secondary knowledge base  A.   Primary Knowledge Base The Primary knowledge base is the most important andessential part of our knowledge base. In fact, this knowledgebase contains information about the concept/instance which issupplied by well-informed users. The primary knowledge basecontains the set of data bases which are related to specificdomain such as medicine, chemistry, physics, geographic, etc.Each data base includes words which are extracted fromprevious web pages and documents together with theirconcepts. As a word can convey different concepts in differentdomains, it may exist in two or more data bases. For example,the word "water" in chemistry means binary compound A New Approach to Annotate the Text's of theWebsites and Documents with a QuiteComprehensive Knowledge Base Mohammad Yasrebi, Mehran Mohsenzadeh, and Mashalla Abbasi-Dezfuli S World Academy of Science, Engineering and Technology 45 2008279   (H2O), but in physics is in the category of liquids. Therefore,we have to have a data base in each domain for these words.These data bases (the parts of the primary knowledge base)are going to become complete as the time passes, and in anideal situation all words of a specific domain are identifiedand implemented in the database. Another solution is havingone general data base for all domains instead of a data base foreach domain, but in this data base we consider different fieldsfor different domains.  B.   Secondary Knowledge Base As its name implies, the secondary knowledge base is usedto help the primary knowledge base. The latter includes threecomponents as follow:   −   basic knowledge source −   data frame library −   lexicons 1. Basic Knowledge Source  Basic knowledge source (BKS) is the first part of thesecondary knowledge base. Like the virtual world, BKScontains the identified words of all concepts and extractedwords in web documents and source information, subset of thewords of this knowledge base. Thus, BKS is a generalknowledge base and it is not designed for specific areas.BKS contains semantic relations plus concepts and a set of instances data. These relations demonstrate the relationsbetween concepts and existing words with in BKS.In general, BKS has some attributes as follow: −   accessibility −   generality −   richness of relations between conceptsWordNet Ontology [12] completely covers three aboveattributes. However, we can not use only this ontology inorder to perform the extraction and induction of data in itsdata bases and extracted semantic schemas. Because it isdefective for some words, and we reduce these defects withother parts such as data frame library and lexicons. Forexample, the WordNet Ontology can not identify the word"alen" as a person's name, or "222-2222" as a telephonenumber, or "" as an e-mail address, etc.Since WordNet basically consists of information aboutconcepts and their relations (e.g. hyperonyms etc.) YAGO 1  could be considered as additional BKS, since this ontologyincorporates a lot of instanceOf(instance, concept) relationswith broad coverage. 2.    Data Frame Library  Basically in computer-based sciences, data has poorstructure and for describing these data we have to use simpleclassifications such as "integer", "real", "string", etc. On theother hand, we can not identify concepts with theseclassifications. Therefore, we have to use a classification withbetter structure. This classification is presented as data framelibrary and contains the second part of our secondary 1  knowledge base. One of the ways to extract the concepts suchas date, e-mail address, phone number, etc. is to use theregular expressions [11]. It is important to pay attention thatthese regular expressions are used to limit the concepts inontology in addition to identify the concepts. In this paper, wename these regular expressions as data frame library such asthe regexes in C# language for recognizing an e-mail address,telephone number, IP address, etc.Also, the data frame might have other application. Forexample if we have a string as follows in a text: "Address:Shiraz – Eram St. – No. 120"We have to consider a regular expression which canrecognize "Shiraz-Eram St.-No.120" as an address. Thus, inthis case we consider the "key/value" regular expression torecognize these concepts, as shown in Fig. 1 2 . Fig. 1 The example of data frame for recognizing strings that containkey/value   3.    Lexicons  The other part of our secondary knowledge base is lexicons.Lexicons used to enrich WordNet ontology as BKS. Differentsources exists for integrating these lexicons such as WorldWide Web (www) and the "Hyponyms" relation in WordNetontology. According to this discussion we can have thecomplete list of the name of persons, animals, capitals, etc.However, the lexicon plays an important role for recognizingthe instances of the specific concepts and limiting the domain.For example, the WordNet can not identify the concept of theword "alen", but this word exists in the list of the person'sname in lexicons and then lexicons can detect this word as thename of person.III.   A RCHITECTURE OF K NOWLEDGE B ASES  Fig. 2 shows our knowledge bases architecture briefly. As itis shown, this architecture contains all the knowledge baseswhich are described in previous sections and their relations . Fig. 2 The architecture of knowledge bases  In this architecture, the primary knowledge base recognizesthe concept of extracted word in the inspected domain. If thatword does not exit in the related data base (inspected domain),WordNet ontology as BKS recognizes the concepts. In caseswhere the WordNet Ontology is not able to identify some 2 World Academy of Science, Engineering and Technology 45 2008280   concepts, data frame library and lexicons will help theWordNet ontology to recognize the unknown concepts.As shown in the Fig. 2, all components of the secondaryknowledge base are available to the competent user. The userfamiliar with the domain removes the probable inconsistencyamong concept titles in basic knowledge base, lexicon, anddata frame library. The user is also there to identify the wordconcept if there is no help from any knowledge basecomponent.(If the different parts of the secondary knowledgebase have the different outputs for one word, the user caneliminate the inconsistency of these concepts and select themain concept of the current word.) Finally, once the concept isidentified by one of the system components or user, it isinserted in the domain database of the main database, and themain knowledge base would be updated then.For example, if the expression "" is extracted asone word and the primary knowledge base could not identify aconcept, WordNet as BKS will help the primary knowledgebase and will search this word in its data base, but WordNet isnot able to identify the concept. Thus, through detecting theword, the data frame library will recognize that this word is anIP address. Besides, it identifies the concept, and will beinserted into primary knowledge base. From now on, if thereis an expression such as "", the primary knowledgebase will identify it as an IP address. It shows that the primaryknowledge base is getting more complete.The worst case occurs when no knowledge bases canrecognize the word's concept. For example, if "aajbc" is theabbreviation of a company's name, in this special case the userwho knows the domain will help the knowledge base andinserts its concept.Let us review some advantages of our suggested approach:1.   Employing a highly appropriate knowledge base of concepts related to instances and entities existing inthe text.2.   Allowing user to remove the possible inconsistenciesamong knowledge base components.3.   Allowing user to decide on the appropriateness of theconcept selected for the relevant instance.4.   Working on automatic procedure as much aspossible.IV.   T HE A NNOTATION M ETHOD IN OUR A PPROACH  After preparing the needed knowledge base, based on themethods outlined in previous sections, we can discuss onextracting the word and the concepts and also semanticannotation.First, it is necessary to describe a general view on thearchitecture of our approach and then inspect the details of this project. Fig. 3 shows a general view of the architecture of our approach.As Fig. 3 shows, this process contains 3 separate phases:1.   Determining the text's domain2.   Extracting the words and their concepts3.   Semantic annotation and inserting tag process Fig. 3 Architecture of our approach   1.    Determining the Text's Domain As we have seen in Fig. 3, since the input file of the systemis a text, we have to request the subject and the text's domainfrom the user who knows the domain.This process can be done as an offer. In other words, thevarious domains are suggested to the user and then he willselect one of them or may insert the domain manually .   2.    Extracting the Words and their Concepts  In this phase, we need to extract words which are conceptsor instances of a concept, and also explain a special meaningsuch as: email address, or name of person, etc.Thus, by using a pattern which determines the words and aloop, we extract the words of the text one by one to the end of the text. So, after analyzing the text to words, we have to sendthe word one by one to knowledge base for determining theirconcepts.At first, we send the word to the primary knowledge baseand the primary knowledge base by identifying the determinedtext's domain will search the word in the data base whichcontains the words related to the domain. If the word exists inthe inspected data base, the concept will be returned;otherwise, the secondary knowledge base will help theprimary knowledge base and determine its concept. The firstchoice for determining current word is the WordNet as BKS.In this part, we have to inspect the word as a noun, verb,adjective or adverb. If the word is a noun the concepts will beextracted. So, we can get count of senses which are related tocurrent word in WordNet. Just three modes may occur:1.   No sense exists for being noun.2.   Existing sense(s) for being noun and also othertypes(verb, adjective, adverb)3.   Existing sense(s) just for noun and no sense forother types.For the first mode, we do not have to inspect the currentword and then extract the concepts for this word, because thecurrent word is not a noun at all. For the second mode, wehave to compare the count of sense(s) related to the noun with World Academy of Science, Engineering and Technology 45 2008281   the other sense(s) which are related to the each type such asverb, adjective or adverb. If the counts of the sense(s) whichare related to the noun are more than the other types, it isobvious that this word can be a noun. Otherwise, we do nothave to inspect the current word and then extract the conceptsfor this word. For the third mode, it is obvious that the currentword is certainly a noun and we have to extract its concepts.After we recognized that the word is a noun, we search theconcepts in WordNet. A list of the extracted concepts isshown to the user and the user will choose the related conceptof the word from the list, or if the user's concept is not in thelist, he has to insert it manually.After the user submits this process that word will beinserted with its concept into the data base which is related tothe text's domain, and as a result the primary knowledge baseis updated and completed more and more.The above cases happen when WordNet can identify theconcept of the word, otherwise, data frame library or lexiconswill help the WordNet.If the word is the same as the one of the existing patterns(regular expressions) in data frame library, the concept isdetermined. For example, it specifies that this word is an emailaddress, or a phone number, or IP address, etc. Otherwise wehave to search in different lists of lexicons and if the samecase is found the concept will be determined. For example, itspecifies that this word shows a person's name. If all of theseknowledge bases could not find the concept(s) of this word,the user who knows the text's domain has to insert the conceptmanually. After determining the concept of the current word,we have to go to the next word and we continue this processto the end of the text. To prevent doing this process twice forthe words which are repeated more than once, we recognizethese repeated words, and the process of extracting theconcept for these words just operates once. 3.   Semantic Annotation and Inserting Tag Process In this last phase, the extracted words in the text with theirconcept are accessible. Thus, by identifying the location of thewords in the text, we insert and add tags which contain theconcept of the words into the text. For example if the word"water" is appeared in the text and its domain is chemical, thistag "Binary_Compound" will be added to the text as follows:<Binary_Compound> water </Binary_Compound>At the end of this phase, the first text that is considered asan input file is annotated with semantic tags. The performedtagging is only for presentation, and RDF format would beconsidered at the moment.V.   E VALUATIONS  In this section we deal with the performance andachievement of our system. To do so, the evaluation process iscarried out in two phases. First, the system output wascompared with manual output of a human annotator. It wasthought that manual annotation is done under an ideal, highlyaccurate condition. Such evaluation, however, would be time-consuming and awkward especially when it involves a greatnumber of documents and web-pages. As such, relying onsoftware even with a margin of error would be reasonable. Inthe second phase of evaluation, the system output wascompared with one of the existing annotation tools, calledOntea. We selected this tool since it was noticeablycompatible with our system. Ontea employs regularexpressions and patterns as well as knowledge base to performannotation process. In this evaluation, 50 html web-pages onbusiness job offer were delivered to both systems and bothsystems' outputs were compared. To cope with the task,following standard parameters were taken into account [15]-[9]: FN TPTPcall += Re (1) FPTPTPecision += Pr (2)TP is the number of items correctly assigned to a category(True Positives).FP is the number of items incorrectly assigned to a category(False Positives).FN is the number of items incorrectly rejected from a category(False Negatives).We also calculated F-measure, the harmonic mean of recalland precision:  precisioncallecisioncallmeasureF  +××=− RePrRe2 (3)After achieving the outputs, the relevant parameters werecalculated. The results are shown in Table I: TABLE   IC OMPARISON OF OUR S YSTEM WITH O NTEA   As shown in the Table I, the measure of recall indicates thatonly %10 of the required correct annotation is not performedby this system. In other words, in %90 of cases our system hasmanaged to map the instances existing in the text to theappropriate concepts of the ontology, and the result isstatistically satisfying. Needless to say, the amount of recall islikely to reach %100 if the structure of pages are improved.The measure of precision parameter indicates that %25 of annotation performed by the system is incorrect, or aninstance is mapped to a wrong concept. The high rate of thisfigure, i.e. %25 is due to the polysemy of words in differentpages. Even sometimes one word may have two totallydifferent concepts in two different documents with one similardomain. In such special case, our system inputs the concept inthe second document as it was done in the former one. It World Academy of Science, Engineering and Technology 45 2008282   would be wrong, however, a user familiar with the domain isable to resolve the trouble. The F-measure also shows thegeneral status of the system. In sum, the results of performance of our system imply its efficiency.The main reason of our system's better performance is ourmore comprehensive knowledge base. As Ontea works onlywith patterns, it is more useful in pages which follow explicit,pre-defined structures. For example, if the name of a companythat offers a job is as follows, Ontea would be able to identifyit:Company: LogitechTherefore, it would be an appropriate tool to identify suchpages. But, on pages which lack a clear-cut structure, Onteafails to identify the existing entities of the text. Theknowledge base of our system is a database including a quitecomplete lexicon as well as a comprehensive grammar andregular expressions, and also lists of various entities. It is notonly a much better knowledge base that can identify theentities on explicit structures, but also it is able to identify theentities on unstructured pages. Table II, extracted from [3]indicates the superiority of our system to other mentionedones. TABLE   II   E XPERIMENTAL R ESULTS FROM [3] In general, our system performs successfully on pageswhich make use of numerous words and concepts. When thepages include a great number of figures, however, our systemloses its efficiency. This problem arises because of our basicknowledge base, i.e. WordNet. The drawback could beovercome by structuring such pages using regular expressions.VI.   C ONCLUSION  The Semantic Web requires the widespread availability of document annotations in order to be realized. Benefits of adding meaning to the Web include: query processing usingconcept-searching rather than keyword-searching [1]; customweb page generation for the visually-impaired [16]; usinginformation in different contexts, depending on the needs andviewpoint of 48 the user [5]; and question-answering [10].In this system, concepts are extracted based on a quitecomprehensive knowledge base. This knowledge baseincludes a Basic Knowledge Base including a quite completeset of words, the sets of grammars and data frames, andvarious lists of different entities' names. The performedprocedure in our system has been done under the control of auser familiar with the text domain, and therefore annotationprocess is performed semi-automatically. The superiority of our system to other similar ones is illustrated through acomparative study. Our future endeavor is enhancing the usedalgorithm, enriching the primary and secondary knowledgebase, and also increasing the system's capability in identifyingnumerical concepts in unstructured web-pages. Other futurework would be further evaluation on our suggested methodconsidering other aspects. We hope to evaluate the system onhigher number of pages, numerous domains, and pages withvarious contents including words, numbers, and figures .  R EFERENCES[1]   T. Berners-Lee, J. Hendler., O. Lassila, “The Semantic Web,” ScientificAmerican, 2001, pp. 34-43.[2]   E. Charniak, M. Berland, “Finding parts in very large corpora,” in Proc.   37th Annual Meeting of the ACL   Conf. , 1999, pp. 57-64.[3]   M. Ciglan, M. Laclavik, M. Seleng, L. Hluchy, “Document indexing forautomatic semantic annotation support,” 2007.[4]   P. Cimino, S. Handschuh, S. Staab, “Towards the Self-AnnotatingWeb,” in 13th International Conf. on World Wide Web , 2004, pp. 462-471.[5]   S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, “SemTagand Seeker: Bootstrapping the Semantic Web via Automated SemanticAnnotation,” in 12th International World Wide Web Conf. , Budapest,Hungary, 2003, pp. 178-186.[6]   S. Handschuh, S. Staab, F. Ciravogna, “S-CREAM -- Semi-automaticCREAtion of Metadata,” in SAAKM 2002 -Semantic Authoring, Annotation & Knowledge Markup - Preliminary Workshop Programme ,2002.[7]   A. Kiryakov, B. Popov, I. Terziev, D. Manov, D. Ognyanoff, “SemanticAnnotation, Indexing, and Retrieval,”  Elsevier’s Journal of WebSematics, vol. 2, 2005.[8]   N. Kiyavitskaya, N. Zeni1, J.R. Cordy, L. Mich, J. Mylopoulos, “Semi-Automatic Semantic Annotations for Web Documents,” 2005.[9]   N. Kiyavitskaya, N. Zeni1, J.R. Cordy, L. Mich, J. Mylopoulos, “Tool-Supported Process for Semantic Annotation: An ExperimentalEvaluation,” 2005.[10]   P. Kogut, W. Holmes, “AeroDAML: Applying Information Extraction toGenerate DAML Annotations from Web Pages,” in Proc. Workshop onKnowledge Markup and Semantic Annotation at the First InternationalConference on Knowledge Capture (K-CAP 2001), Victoria, BC, 2001.[11]   M. Laclavik, M. Seleng, E. Gatial, Z. Balogh, L. Hluchy, “Ontologybased Text Annotation – OnTeA,”  Information Modelling and Knowledge Bases XVIII. IOS Press, Amsterdam, Marie Duzi, Hannu Jaakkola, Yasushi Kiyoki, Hannu Kangassalo (Eds.), Frontiers in Artificial Intelligence and Applications , vol. 154, February 2007,pp.311-315.[12]   G. Miller, “WordNet: An On-line Lexical Database,” Special Issue, International Journal of Lexicography , vol. 3, 1990. WordNet: [13]   B. Popov, A. Kiryakov, A. Kirilov, D. Manov, D. Ognyanoff, M.Goranov, “KIM – Semantic Annotation Platform,” in 2nd InternationalSemantic Web Conf. (ISWC2003) , Florida, USA, 2003, pp. 834-849.[14]   L. Reeve, H. Han, “Survey of semantic annotation platforms,” in SAC '05 , ACM Press, NY, USA, 2005, pp. 1634-1638.[15]   Y. Yang, “An evaluation of statistical approaches to text categorization,”  Journal of Information Retrieval , vol. 1, 1999, pp. 67–88.[16]   Y. Yesilada, S. Harper, C. Goble, R. Stevens, “Ontology BasedSemantic Annotation for Visually Impaired Web Travellers,” in Proc.4th International Conference on Web Engineering (ICWE 2004),  Munich, Germany,2004, pp. 445-458. World Academy of Science, Engineering and Technology 45 2008283
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!