A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools
of 26
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Verspoor  et al. BMC Bioinformatics 2012,  13 :207 RESEARCH ARTICLE OpenAccess A corpus of full-text journal articles is a robustevaluation tool for revealing differences inperformance of biomedical naturallanguage processing tools Karin Verspoor 1* , Kevin Bretonnel Cohen 1,2* , Arrick Lanfranchi 2,3 , Colin Warner 4 ,Helen L Johnson 1 , Christophe Roeder 1 , Jinho D Choi 3 , Christopher Funk  1 , Yuriy Malenkiy 1 ,Miriam Eckert 2 , Nianwen Xue 4 , William A Baumgartner Jr 1 , Michael Bada 1 , Martha Palmer 2 andLawrence E Hunter 1 Abstract Background:  We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known asthe Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools forperforming sentence splitting, tokenization, syntactic parsing, and named entity recognition onthis corpus. Results:  Many biomedical natural language processing systems demonstrated large differences between theirpreviously published results and their performance on the CRAFT corpus when tested with the publicly availablemodels or rule sets. Trainable systems differed widely with respect to their ability to build high-performing modelsbased on this data. Conclusions:  The finding that some systems were able to train high-performing models based on this corpus isadditional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overallpoor performance of various systems indicates that considerable work needs to be done to enable natural languageprocessing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuableresource to the biomedical natural language processing community for evaluation and training of new models forbiomedical full text publications. Background Text mining of the biomedical literature has gainedincreasing attention in recent years, as biologists areincreasinglyfacedwithabodyofliteraturethatistoolargeand grows toorapidly tobereviewedbysingle researchers[1]. Text mining has been used both to perform targetedinformation extraction from the literature, e.g. identify-ing and normalizing protein-protein interactions [2], and *Correspondence:; kevin.cohen@gmail.com1Computational Bioscience Program, U. Colorado School of Medicine, 12801 E17th Ave, MS 8303, Aurora, CO 80045, USA2Department of Linguistics, University of Colorado Boulder, 290 Hellems,Boulder, CO 80309, USAFull list of author information is available at the end of the article to assist in the analysis of high-throughput assays, e.g. toanalyze relationships among genes implicated in a diseaseprocess[3].Systemsperformingtextminingofbiomedicaltext generally incorporate processing tools to analyze thelinguistic structure of that text. At a syntactic level, sys-tems typically include modules that divide the texts intoindividual word or punctuation tokens, delimit sentences,and assign part-of-speech tags to tokens. It is becom-ing increasingly common to perform syntactic parsing of the texts as well, either with a full constituent parse ora dependency parse representation. At a more concep-tual level,  named entity recognition , or identification of mentions of specific types of entities such as proteins orgenes, is a widely used component of systems that aim © 2012 Verspoor et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (, which permits unrestricted use, distribution, andreproduction in any medium, provided the srcinal work is properly cited.  Verspoor  et al. BMC Bioinformatics  2012, 13 :207 Page 2 of26 to perform entity-oriented text mining. Historically, themajority of research in biomedical natural language pro-cessing has focused on the abstracts of journal articles.However, recent years have seen numerous attempts tomoveinto processingthebodiesof journal articles.Cohenet al. [4] compared abstracts and article bodies and foundthat they differed in a number of respects with implica-tions for natural language processing. They noted thatthese differences sometimes demonstrably affected toolperformance. For example, gene mention systems trainedon abstracts suffered severe performance degradationswhen applied to full text.It has been previously noted that there was inadequatelinguistically annotated biological text to make domain-specific retraining of natural language processing toolsfeasible [5]. With the release of CRAFT, we now have alarge resource of appropriately annotated full text articlesin the biomedical domain to enable both evaluation andretraining.In this paper, we will introduce the linguistic annota-tion of a significant new resource, the Colorado Richly Annotated Full Text (CRAFT) corpus. CRAFT consistsof the full contents of 97 Open Access journal articles,comprising nearly 800k tokens. CRAFT has been man-ually annotated with a number of elements of linguisticstructure, corresponding to the functions listed above. Ithas also been annotated with semantic content, of biolog-ical concepts from several semantic classes characterizedby biological ontologies. In prior work, we establishedthat Open Access journal articles do not differ in termsof linguistic structure or semantic content from tradi-tional journal articles [6] and therefore take this corpusas representative of the biomedical literature more gen-erally. Along with this paper, we are publicly releasing67 (70%) of the articles, constituting 70.8% of the tokensin the full corpus. It is available at availability of the CRAFT corpus makes it possi-ble for the first time to evaluate a number of hypotheseswith exciting implications for the near-term developmentofbiomedicaltextmining. Inthiswork,weexploreseveraluses of theCRAFT corpus for evaluating the performanceof natural language processing tools. We specifically con-sider (a) the generalizability of training algorithms andexisting models to the new corpus, and (b) the impactof the availability of full text training data for new modeldevelopment. A priori, genre differences have plaguednatural language processing for years, and full texts areclearly a different genre from that which most extantsystems have been developed on — abstracts of journalarticles [4]. Those who have worked with full text havenotedvariousways inwhichfull texts differfromabstracts[7-11], mainly focusing on distributional differences of certain types of keywords and assertions. Nonetheless,a few authors have developed systems to process fulltext. Friedman and Rzhetsky developed the GENIES sys-tem, which processes full-text journal articles [10], Rzhet-sky’s GeneWays system does as well [12], and the recentBioCreativeIII evaluation required systems to processfulltext [13].In this work we first introduce the syntactic annota-tion of the CRAFT corpus. The annotation of genes andontological concepts is described in more detail in Badaet al. (2012) [14].Next, with this sufficiently large collection of annotatedbiomedical full text documents, we report the head-to-head performance of a number of language processingtools selected for their difficulty, for their relevance toany language processing task, and for their amenabil-ity to evaluation with well-annotated gold standarddata. Specifically, we examined the performance of toolsfor: •  Sentence boundary detection •  Tokenization •  Part-of-speech tagging •  Syntactic parsing •  Named entity recognition, specifically of gene names Sentence boundary detection was included because it isan essential first task for any practical text mining appli-cation. Tokenization was included both because it is anessential prerequisite for any practical language process-ing application and because it is notoriously difficult forbiomedical text (see e.g. [1,15]). Part-of-speech taggingand syntactic parsing were included because the use of syntactic analyses in biomedical text mining is a bur-geoning area of interest in the field at present [16,17].Finally, gene mention recognition was included becauseprior work has shown drastic differences in gene mentionperformance on full text across a range of gene mentionsystems and models [4]. We perform a broad survey of existing systems and models, and also retrain systems onthe full-text data to explore the impact of the annotatedtraining data.Previous investigations of syntactic parser performanceon biomedical text [5,18] have focused on parser per-formance on biomedical abstracts rather than full textpublications.In particular, [18]evaluatesaccuracy ononly 79 manually reviewed sentences, while [19,20] exploresimilarly small corpora of 300 and 100 sentences, respec-tively. The CRAFT corpus, in contrast, contains over20,000 manually analyzed parsed sentences in the portionwe are publicly releasing at this time – the full contentsof 67 journal articles, containing over 500k tokens (seethe Methods section for details on the partitioning of the data).  Verspoor  et al. BMC Bioinformatics 2012,  13 :207 Page 3 of 26 Priorbiomedicalcorpusannotationwork  There has been significant prior work on corpus anno-tation in the biomedical domain. Until the very recentpast, this has focused on the biological, rather than themedical, domain. The biological corpora are most rele- vant tothe workdiscussed here, so we focus on them. Thebiomedical corpora site at currently lists 26 biomedicalcorpora and document collections. Of this large selec-tion, we review here only some of the most influential orrecent ones.The flagship biomedical corpus has long been theGENIA corpus [21,22]. Studies of biomedical corpususage and design in [23,24] reviewed several biomedicalcorpora extant as of 2005 with respect to their design fea-tures and their usage rates outside of the labs that builtthem. Usage rates outside of the lab that built a cor-pus was taken as an indicator of the general usefulnessof that corpus. These studies concluded that the mostinfluential corpus to date was the GENIA corpus. Thiswas attributed to two factors: the fact that this was theonly corpus containing linguistic and structural anno-tation, and the fact that the corpus was distributed instandard, easy-to-process formats that the natural lan-guage processing community was familiar with. In con-trast, the other corpora lacked linguistic and structuralannotation, and were distributedin one-off, non-standardformats.The GENETAG corpus [25] has been very useful inthe gene mention recognition problem. It achieved widecurrency due to its use in two BioCreative shared tasks.The BioInfer corpus [26] is a collection of 1100 sen-tences from abstracts of journal articles, annotated withentities according to a self-defined ontology and show-ing relationships between them by means of a syntac-tic dependency analysis. The BioScope corpus [27] isa set of 20,000 sentences that have been annotated foruncertainty, negation, and their scope. Most recently,the various data sets associated with the Associationfor Computational Linguistics BioNLP workshop [17,28]have been widely used for their annotations of multi-ple biological event types, as well as uncertainty andnegation. ResultsandDiscussion Annotationof documentstructure,sentenceboundaries,tokens,andsyntax Syntacticannotation:introduction Although CRAFT is not the first corpus of syntactically annotated biomedicaltext, itprovidesthefirstconstituentannotation of full-text biomedical journal articles. PennTreebank’s BioIE project provided much of the basicskeleton for the workflow of this type of annotation.However, we did have to make several new policies orexpand existing PTB policies for syntactic annotation inthe biomedical domain (discussed below).The markup process of the CRAFT corpus consisted of phases of automatic parsing and manual annotation andcorrection of all 97 articles in the corpus. Automatic seg-mentation and tokenization were performed, then part of speech tags were automatically applied to every token inthe data according to each token’sfunction in a given con-text (fordetails see below). We employed Penn Treebank’sfull Part of Speech tagset (which consists of 47 tags; 35POS tags and 12 punctuation, symbol, or currency tags)without any alterations (see Additional file 1 for the fulltagset). This output was then hand corrected by humanannotators.After hand correction, the data was then automatically parsed into syntactic phrase structure trees with PennTreebank’s phrasal tagset. Syntactic nodes indicate thetype of phrase of which a token or a group of tokens isa part. They form constituents that are related to oneanother in a tree structure where the root of the treeencompasses the largest construction and the branchessupply the relationship between the main componentsof the tree (subject, verb/predicate, verb arguments andmodifiers) and each of these main components may con-tain internal phrase structure. CRAFT added 4 nodesrepresenting article structure, CIT, TITLE, HEADINGand CAPTION (discussed below), to the srcinal tagset.The automatically processed trees were then hand cor-rected.Automaticparsingdidnotprovidefunction tagsorempty categories, which were also adapted from the PennTreebank syntactic tagset, so those were added by handduring bracketing correction. Function tags are appendedto node labels to provideadditional information about theinternal structure of a constituent or its role within theparent node. CRAFT added one new function tag, -FRM(discussed below).Emptycategoriesprovideaplaceholderfor material that has been moved from its expected posi-tion in the tree, arguments that have been dropped, suchas an empty subject, or material that has been elided.Thedatawasfinalizedwithtwoiterationsofqualitycon-trolverificationtoensurethat all thedata wasconsistently annotated and that all policy changes that wereadapted atdifferent stages of the project were properly implementedacross all data. A rough estimateof the total time requiredto syntactically annotate the full corpus is approximately 80 hours a week for 2.5 years (including 6 months fortraining).Given the input text, “Little is known about genetic fac-torsaffecting intraocular pressure(IOP) in miceand othermammals” (PMCID 11532192), the final segmented, tok-enized, part-of-speech tagged, syntactically parsed andannotated output is as follows, with each phrase in paren-theses and part of speech tags to the left of their respec-tive tokens.  Verspoor  et al. BMC Bioinformatics  2012, 13 :207 Page 4 of26 (S (NP-SBJ-1 (JJ Little))(VP (VBZ is)(VP (VBN known)(NP-1 (-NONE-  ∗ ))(PP (IN about)(NP (NP (JJ genetic)(NNS factors))(VP (VBG affecting)(NP (NP (NP (JJ intraocular)(NN pressure))(NP (-LRB- -LRB-)(NN IOP)))(-RRB- -RRB-)(PP-LOC (IN in)(NP (NP (NNS mice))(CC and)(NP (JJ other)(NNS mammals)))))))))).) We describe below the major implementations andpolicy adaptations that yield the above tree. Selectionandamendmentofannotationguidelines For the POS annotation, we chose to follow the 3rd revi-sion of the POS-tagging guidelines of the Penn Treebankproject [29].For the treebanking, we have followed the guidelines forTreebankII [30-32]andTreebank2a[33]along withthosefor the BioIE project [34], which is an addendum to theTreebank II guidelines based on annotation of biomedicalabstracts. Employing these guidelines of the Penn Tree-bank project enables us to contribute our collection of richly annotated biomedical texts to a larger collection of treebanked data that represents a multitude of genres andthat already includes biomedicaljournal abstracts. Finally,we modified or extended these guidelines to account forbiomedical phenomena not adequately addressed in them(see Additional file 2 for the CRAFT addenda to the PTB2and BioIE guidelines). A set of these changes was madeat the beginning of the project resulting from exami-nation of the corpus articles, and further changes weremade throughout the course of the project as issues arose;descriptions and examples of these changes can be seenbelow. Trainingofannotatorsandcreationofmarkup The lead syntactic annotator (CW), who had five yearsof syntactic annotation experience, first trained the seniorsyntactic annotators (AL, AH), the former of whomtrained a third senior syntactic annotator (TO). Theselead and senior annotators were responsible for policy changes, documentation, quality control, and training of additional annotators, who were required to have someknowledge of syntax and semantics (with at least one year of completed Master’s-level linguistics coursework)and some previous experience in syntactic annotation.These additional annotators were first trained to performPOS tagging for approximately one month with Penn’snewswire training files and then on a chapter of an intro-ductory biology book [35], followed by treebanking train-ing for several weeks to one month on short training filesobtainedthroughthePennTreebankproject.Treebankingtraining continued on the aforementioned book chapterand finally on the first article of the corpus. Altogether,training for syntactic annotation lasted approximately sixmonths. All training was performed on flat text (i.e., textthat had not been automatically annotated).For the syntactic annotation of the corpus, sentencesegmentation, tokenization, and POS markup was firstautomatically generated using the GENIA parser. Eacharticle’s automatically generated markup was manually corrected by one annotator in the lex mode of Emacs.This was followed by the automatically generated tree-banking of these articles (with the corrected segmenta-tion, tokenization, and POS markup) using the parser of the OpenNLP project. Each article’s automatically gen-erated treebanking markup was then manually correctedby one annotator using TreeEditor. Since they are notgenerated by this parser, the annotators used TreeEditorto add empty categories, which are syntactic place hold-ers in the tree construction that indicate arguments thathave been moved from their expected positions in thetrees, and functions tags, which specify additional infor-mation about phrases not represented in the treebank-ing markup, e.g., the location of an action. Additionally,  Verspoor  et al. BMC Bioinformatics 2012,  13 :207 Page 5 of 26 sentence-segmentation errors not previously found werecorrected manually outside of TreeEditor, as it does nothave the capability of merging sentences. The correctedoutput of this annotator was checkedby thesyntactic leadannotator.The output of the syntactic lead then underwent thefinal phase of syntactic annotation, referred to as thequality-control phase. This phase consisted of automatic validation of POS tags (e.g., checking that a phrase anno-tated as a prepositionalphraseactually begins withawordannotated as a preposition) and of sentences (e.g., check-ing that each S node had a subject and a verb at theappropriate level of attachment) using CorpusSearch fol-lowed by manual correction of indicated errors. This stepallowed us to confirm tree uniformity, to verify that errorshad not been introduced during the manual correctionof previous passes, and to ensure that changes in anno-tation guidelines or policy made during the project wereconsistently reflected in the final output. For example,during the course of annotation, the treatment of prepo-sitional phrases beginning with “due to” changed frombeingannotatedasrecursiveprepositions,i.e.,(PPdue(PPto)), to being annotated as flat multiword prepositions,i.e., (PP due to). A validation script was written to detectrecursively annotated occurrences of such prepositionalphrases, an example of which is provided below. These results explain why defectivePDGF signal transduction results in areduction of the v / p cell lineageand ultimately in perinatal lethalitydue to vessel instability (Hellstr¨omet al. 2001).68 PP-PRP: 68 PP-PRP, 69 IN, 70 due,71 PP(62 NP (63 NP (64 JJ perinatal) (66 NNlethality))(68 PP-PRP (69 IN due)(71 PP (72 IN to)(74 NP (75 NN vessel)(77 NNinstability))))))) This error message indicates that there is a recursive PPerror and provides the full sentence, the reference num-ber(s) of the element(s) involved in the error, and thecurrent parse of the tree. Given this output, the annota-tor manually corrected this error in the file by deleting theextra PP node for “to”. Guidelines Full-text journal articles present issues that can beuniquely distinguished from the style of the abstracts thatthe Penn BioIE project annotated. We found that Penn’sguidelines for biomedical abstract annotation did notcover the increased technical complexity of a full-lengtharticle, such as the parenthetical information, definitions,and figure and table captions found throughout a full-textarticle, necessitating regular policy review and addendumconstruction. Major changes to Penn’s guidelines includeadditionofnodelabels TITLE,HEADING andCAPTIONto replace the -HLN function tag (see below), and CIT forcitations. We have added one new function tag, -FRM, tothe top-level constituent (S) of formulas, where a mathe-matical symbol ( < ,  > , =) is treated as a verb. The use of thePRNnodelabelhasbeenexpandedfromtheTB2apol-icy [33], which only allows for a clausal PRN (reference,appositive-likeadjectives). Because of the large number of nominals and other parentheticals in the CRAFT data wehave allowed any node label inside of PRN. The use of the-TTL function tag has been slightly modified from ETTBas well. Each of these node and function label additionsand expansions have been made in order to provide label-ing that accurately represents the more complex structureof biomedical articles.We have also changed how shared adjuncts are brack-eted, which are now adjoined to coordinated VP or S,added more structure to single token coordinated NMLs,andrefinedPenn’sPOSandtokenizationpolicytoaccountfor additional symbols, such as  ◦  (degree) (as in 35°C).Another significant change we have made is the elimina-tion of PP-CLR. PTB2 allows for PP-CLR on verbal argu-ments. However, we felt that this policy was not clearly defined and it was difficult to consistently apply. We haveretainedthe-CLRinS-CLRforresultativesandsecondary predicates.The last change we implemented was the completeelimination oftheempty category*P*(placeholder fordis-tributed material) introduced in the Penn BioIE guidelineaddendum. With the increased complexity of full-lengtharticles, we felt that these policies were difficult to apply consistently and greatly increased the complexity of theannotation and resulting trees. We maintain that existingpolicy on NML and NP coordination preserves much of the same information represented by *P*.In PTB2, the -HLN function tag indicates a headlineor a dateline, as found in newswire texts. However, thesection headings in journal articles have a slightly dif-ferent function and convey different information than anews headline. Since the treebanked data are journal arti-cles, we are using more informative labels for nodes thatwould have been tagged with -HLN (see example below)based on newswire bracketing guidelines (Guidelines [31]Section Ongoing and future work). (FRAG-HLN (NP-SBJ-1 Clinton)(S-PRD (NP-SBJ  ∗ PRO ∗ -1)
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks