Government Documents

Distinguishing the species of biomedical named entities for term identification

Description
Distinguishing the species of biomedical named entities for term identification
Published
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  BioMed   Central Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics Open Access Research Distinguishing the species of biomedical named entities for term identification  XinglongWang* 1,3  and MichaelMatthews 2  Address: 1 National Centre for Text Mining, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK, 2 School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, UK and 3  The work described in this paper was carried out at School of Informatics, University of Edinburgh, UK Email: XinglongWang*-xinglong.wang@manchester.ed.ac.uk; MichaelMatthews-m.matthews@ed.ac.uk * Corresponding author Abstract Background: Term identification is the task of grounding ambiguous mentions of biomedicalnamed entities in text to unique database identifiers. Previous work on term identification hasfocused on studying species-specific documents. However, full-length articles often describeentities across a number of species, in which case resolving the ambiguity of model organisms inentities is critical to achieving accurate term identification. Results: We developed and compared a number of rule-based and machine-learning basedapproaches to resolving species ambiguity in mentions of biomedical named entities, anddemonstrated that a hybrid method achieved the best overall accuracy at 71.7%, as tested on thegold-standard ITI-TXM corpora. By utilising the species information predicted by the hybrid tagger,our rule-based term identification system was improved significantly by up to 11.6%. Conclusion: This paper shows that, in the context of identifying terms involving multiple modelorganisms, integration of an accurate species disambiguation system can significantly improve theperformance of term identification systems. Background  The exponential growth of the amount of scientific litera-ture in the fields of biomedicine and genomics has madeit increasingly difficult for scientists to keep up with thestate of the art. The  TXM  project [1], a three-year project  which aims to produce software tools to aid curation of biomedical papers, targets this problem and exploits nat-ural language processing ( NLP ) technology in an attempt to automatically extract enriched protein-protein interac-tions ( EPPI ) and tissue expressions (  TE ) from biomedicaltext. A critical task in  TXM  is term identification (  TI ), the task of grounding mentions of biomedical named entities toidentifiers in referent databases.  TI  can be seen as an inter-mediate task that builds on the previous component in aninformation extraction ( IE ) pipeline, i.e., named entity recognition ( NER  ), and provides crucial information as from Natural Language Processing in Biomedicine (BioNLP) ACL Workshop 2008Columbus, OH, USA. 19 June 2008Published: 19 November 2008 BMC Bioinformatics  2008, 9 (Suppl 11):S6doi:10.1186/1471-2105-9-S11-S6 <supplement> <title> <p>Proceedings of the BioNLP 08ACL Workshop:Themesin biomedicallanguage processing</p> </title><editor>DinaDemner-Fushman,K BretonnelCohen, SophiaAnaniadou,John Pestian,Jun'ichiTsujiiand Bonnie Webber</editor> <note>Research</note> </supplement> This article is available from: http://www.biomedcentral.com/1471-2105/9/S11/S6© 2008 Wang and Matthews; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0   ), which permits unrestricted use, distribution, and reproduction in any medium, provided the srcinal work is properly cited.  BMC Bioinformatics  2008, 9 (Suppl 11):S6http://www.biomedcentral.com/1471-2105/9/S11/S6Page 2 of 9 (page number not for citation purposes) input to the more complex module of relation extraction( RE ). The structure of the IE  pipeline resembles a typicalcuration process by human biologists. For example, whencurating protein-protein interactions ( PPI s), a curator  would first mark up the protein mentions in text, and thenidentify the mentions by finding their unique identifiersfrom standard protein databases such as RefSeq [2], andfinally curate pairs of IDs as PPI s.  TI  is a matching and disambiguation process [3], and a pri-mary source of ambiguity lies in the model organisms of the terms. In curation tasks, one often needs to deal withcollections of articles that involve entities of a large variety of species. For example, our collection of articles fromPubMed and PubMed Central involve over 100 modelorganisms. Also, it is often the case that more than onespecies appear in the same document, especially when thedocument is a full-length article. In our dataset, 74% of the articles concern more than one organism. In many standard databases, such as RefSeq and SwissProt,homolog proteins in different species, which often con-tain nearly identical synonym lists, are assigned distinct identifiers. This makes biomedical terms even morepolysemous and hence species disambiguation becomescrucial to  TI . For example, querying RefSeq with the pro-tein mention plk1 resulted in 98 hits. By adding a speciesto the query, e.g. mouse , one can significantly reduce thenumber of results to two. The most relevant work to ours are the Gene Normalisation ( GN ) tasks [4,5] in the BioCreAtIvE I & II workshops [6,7].  The data provided in the GN  tasks, however, were species-specific, which means that the lexicons and datasets wereconcerned with single model organisms and thus speciesdisambiguation was not required. A few participating sys-tems, however, integrated a filter to rule out entities witherroneous species [8,9], which were reported to be help- ful. Another difference between our task and the BioCreA-tIvE GN  ones is that we carry out  TI  on entity level while GN on document level.It is worth mentioning that the protein-protein interac-tion task ( IPS ) in BioCreAtIvE II has taken into account species ambiguity. The IPS  task resembles the work-flow of manual curation of PPI s in articles involving multiple spe-cies, and to accomplish the task, one would require a fullpipeline of IE  systems, including named entity recogni-tion, term identification and relation extraction. The best result for IPS [10] was fairly low at 28.85% F  1, whichreflects the difficulty of the task. Some participants of IPS have reported (e.g., [11]) that resolving species ambiguity  was one of the biggest challenges. Our analysis of the IPS training data revealed that the interacting proteins in thiscorpus belong to over 60 species, and only 56.27% of them are human . As noted in previous work [10-14], determining the cor- rect species for the protein mentions is a very important step towards  TI . However, as far as we know, there hasbeen little work in species disambiguation and in to what extent resolving species ambiguity at an entity level canhelp  TI . Results and discussion Species disambiguation  The species tagger was developed on the ITI TXM  corpora[15], which were produced as part of the  TXM  project [1]. We created two corpora in slightly different domains, EPPI and  TE . The EPPI  corpus consists of 217 full-text papersselected from PubMed and PubMed Central and domainexperts annotated all documents for both protein entitiesand PPI s, as well as extra (enriched) information associ-ated with the PPI s and normalisations of the proteins topublicly available ontologies. The  TE  corpus consists of 230 full-text papers, in which entities such as proteins, tis-sues, genes and m RNA  c  DNA  s were identified, and a new tis-sue expression relation was marked up. We used these corpora to develop a species tagging sys-tem. The biomedical entities in the data were manually assigned with standard database identifiers, where  genes  were assigned with EntrezGene IDs, and proteins and mRNAcDNAs  with RefSeq IDs. Hence it was straightfor- ward to obtain their species IDs through the mappingsprovided by EntrezGene and RefSeq. In more detail, pro-teins, protein complexes, genes and m RNA  c  DNA  s in both EPPI  and  TE  datasets were assigned with NCBI Taxonomy IDs (TaxIDs) [16], to denote their species. The EPPI  and  TE datasets have different distributions of species. For exam-ple, the entities in the EPPI  training data belong to 118 spe-cies with human being the most frequent at 51.98%, andthose in the  TE  training set are across 67 species and mouse  was the most frequent at 44.67%. To calculate the inter-annotator-agreement, about 40% of the documents were doubly annotated by different anno-tators. The averaged F  1 scores of species annotation on thedoubly annotated EPPI  and  TE  datasets are 86.45% and95.11%, respectively, indicating that human annotatorshave high agreement when assigning species to biomedi-cal entities. To assess how much species ambiguity accounts for theoverall ambiguity in biomedical entities, we estimated theaveraged ambiguity rates for the protein entities in the  TXM datasets, without and with the species information.Suppose there are n unique protein mentions in a dataset.First, we look up the RefSeq database by exact match withevery unique protein mention m i , where i ∈  {0.. n - 1}, and  BMC Bioinformatics  2008, 9 (Suppl 11):S6http://www.biomedcentral.com/1471-2105/9/S11/S6Page 3 of 9 (page number not for citation purposes) for each m i  we retrieve two lists of identifiers: L i and , where L i consists of all identifiers and only containsthe identifiers whose model organism matches the manu-ally tagged species of the protein mention. The ambiguity rates without and with species are computed by and , respectively. Table 1 shows the ambiguity rates on the EPPI  and  TE  datasets.Using the ITI TXM  corpora, we first devised a number of rule-based species disambiguation systems. It is intuitivethat a species word that occurs near an entity (e.g., "mousep53" ) is a strong indicator of its species. To assess this intu-ition, we developed a set of rules using heuristics and thespecies words detected by a species word tagger (to bedescribed later). PreWd : If the word preceding an entity is a species word,assign the species indicated by that word to the entity.• PreWd Sent  : If a species word occurs to the left of anentity and in the same sentence, assign the species indi-cated by that word to the entity.• Prefix : If an entity has a species-indicating prefix, e.g., mSos-1 , then tag the species to that entity.• Spread : Spread the species of an entity e to all entities inthe same document that have the same surface form with e . This rule must be used in conjunction with the other rules.•  Majority Vote : Count the species words in a document and assign as a weight to each species the proportion of allspecies words in the document that refer to the species.For example, if there are  N species words in a document and  N  human are associated with human , the human species weight is calculated as . Tag all entities in the doc-ument the species with the highest weight, defaulting to human in the case of a tie. This rule was used by default inthe rule-based  TI  system, described later in this paper. Table 2 shows the results of species tagging when theabove rules were applied. As we can see, the precision of the systems that rely solely on the previous species wordsor prefixes is very good but the recall is low. The systemthat looks at the previous species word in the same sen-tence does better as measured by F  1. In addition, spread-ing the species improves both systems but the overallresults are still not satisfactory.It is slightly counter-intuitive that using a rule such as' PreWd ' did not achieve perfect precision. Closer inspec-tion revealed that most of the false positives were due to afew problematic guidelines in the annotation process. For example,• " The amounts of human and mouse CD200R ...", where'CD200R' was tagged as mouse (10090) by the system but the gold-standard answer was human (9606) . This was dueto the fact that the annotation tool was not able to assignmultiple correct species to a single entity.• "... wheat e IFiso4G ...", where 'eIFiso4G' was tagged as wheat (4565) but the annotator thought it was Triticum(4564) . In this case, TaxID 4565 is a species under genus4564, and arguably is also a correct answer. Other similar cases include Xenopus vs. Xenopus tropicalis, and Rattus vs. Rattus norvegicus, etc. This is the main cause for thefalse positives as our system always predicts speciesinstead of genus or TaxIDs of any other ranks, which theannotators occasionally employed.Furthermore, we split the EPPI  and  TE  datasets into training and development test (devtest) sets and developed amachine-learning ( ML  ) based species tagger. Using thetraining splits, we trained a maximum entropy classifier [17] on a number of features such as contextual words andnearly species words, which will be detailed later. The results of the ML   species tagger are shown in Table 3. We measure the performance in accuracy instead of F  1because the ML   based tagger assigns a species tag to every  entity occurrence, and therefore precision is equal torecall. We tested four models on the devtest portions of the EPPI  and  TE  corpora:• BL : a baseline system, which tags the devtest instancesusing the most frequent species occurring in the corre-sponding training dataset. For example, human is the most frequent species in the EPPI  training data, and therefore allentities in the EPPI  devtest dataset were tagged with human . ′ L i ′ L i Liinn =−∑ 01   ′=−∑ Liinn 01  N human N  Table 1: Ambiguity in protein names. Ambiguity in protein entities, with and without species information, in EPPI  and TE  datasets. Protein CntID CntAmbiguity EPPI 6,955184,63326.55 EPPI  species6,95517,3572.50 TE 8,539103,01612.06 TE  species853912,7051.49  BMC Bioinformatics  2008, 9 (Suppl 11):S6http://www.biomedcentral.com/1471-2105/9/S11/S6Page 4 of 9 (page number not for citation purposes) • EPPI  Model : obtained by training the maxent classifier onthe EPPI  training data.•  TE  Model : obtained by training the maxent classifier onthe  TE  training data.• Combined Model : obtained by training the maxent classi-fier on a joint dataset consisting of both the EPPI  and  TE training corpora.Finally, we devised a hybrid species tagging system. As wehave shown, the rules ' PreWd ' and ' Prefix ' achieved very good precision but low recall, which suggests that whenthese rules were applicable, it is highly likely that they  would get the correct species. Based on this observation, we combined the ML   approach and the rule-basedapproach in such a way that the rules ' PreWd ' and ' Prefix ' were applied on top of ML   and to override predictionsmade by ML  . The hybrid systems were tested on the samedatasets and the results are shown in the right 3 columnsin Table 3. We performed significance tests on the resultsin Table 3. First, a Friedman test was used to determine whether the 7 sets of results were significantly different,and then pairwise Wilcoxon Signed Rank tests wereemployed to tell whether any system performed signifi-cantly better than others. On both datasets, the 6machine-learning models significantly outperformed thebaseline ( p < 0.01). On EPPI  devtest dataset, the EPPI  mod-els (with or without rules) and the Combined Models out-performed the  TE  models ( p < 0.05), while on  TE  dataset,the  TE  models and the Combined Models outperformedthe EPPI  models ( p < 0.05). Applying the post filtering rulesdid not significantly improve the ML   models, although it appears that adding the rules consistently increased theaccuracy. Term identification with species disambiguation Experiments on the ITI TXM  corpora  To identify whether species disambiguation can improveperformance of  TI , we ran the  TI  system on the EPPI  and  TE datasets in the ITI TXM  corpora. We tested the  TI  systems with or without help from a number of species tagging systems, including:• Baseline : Run  TI  without species tags. Note that the  TI  sys-tem already integrated a basic species tagging system that uses the  Majority Vote rule. Thus this is a fairly high 'base-line'.• Gold Species : Run  TI  with manually tagged species. This isthe upper-bound performance.• Rule : Run  TI  with species predicted by the rule-based spe-cies tagger using rules "PreWd" and "Prefix".•  ML ( human/mouse ): Run  TI  with the species that occursmost frequently in the training datasets (i.e., human for  EPPI  and mouse for  TE ).•  ML ( EPPI ): Run  TI  with species predicted by the ML   tagger trained on the EPPI  training dataset.•  ML ( EPPI )+ Rule : Run  TI  with species predicted by thehybrid system using both ML( EPPI ) and the rules. Table 2: Results (%) of the rule-based species tagger  EPPI  devtest TE  devtestPRF1PRF1PreWd81.881.873.6591.491.633.21PreWd + Spread63.8514.1723.1977.8417.9729.20PreWd Sent60.795.169.5256.167.7613.64PreWd Sent + Spread39.7450.5444.4931.7146.6837.76Prefix98.983.075.9677.932.975.72PreWd + Prefix91.954.959.4082.274.628.75PreWd + Prefix + Spread68.4617.4927.8777.7721.2633.39Majority Vote44.1044.1044.1049.8749.8749.87 Table 3: Results (%) of the machine-learning and hybrid species taggers. Accuracy (%) of the machine-learning based species tagger and the hybrid species tagger as tested on the EPPI  and TE  devtest datasets. An 'Overall' score is the micro-average of a system's accuracy on both datasets. BL EPPI  Model TE  ModelCombined Model EPPI  Model+Rules TE  Model+RulesCombined Model+Rules EPPI  devtest60.5673.0358.6772.28 74.24 59.6773.77 TE  devtest30.2267.1569.8267.2067.53 70.14 67.47Overall48.8870.7762.9670.33 71.66 63.7071.34  BMC Bioinformatics  2008, 9 (Suppl 11):S6http://www.biomedcentral.com/1471-2105/9/S11/S6Page 5 of 9 (page number not for citation purposes) •  ML (  TE ): Run  TI  with species predicted by the ML   tagger trained on the  TE  training dataset.•  ML (  TE )+ Rule : Run  TI  with species predicted by the hybridsystem using both ML(  TE ) and the rules.•  ML ( EPPI +  TE ): Run  TI  with species predicted by the ML   tag-ger trained on both EPPI  and  TE  training data.•  ML ( EPPI +  TE )+ Rule : Run  TI  with species predicted by thehybrid system using both ML( EPPI +  TE ) and the rules. We score the systems using t  op n precision, where n ∈  {1,5, 10, 15, 20}. The argument for this evaluation scheme isthat if a  TI  system is not good enough in predicting a singleidentifier correctly, a 'bag' of IDs with the correct answer included would also be helpful. The 'Avg. Rank' fielddenotes the averaged position where the correct answer lies in, and the lower the value is, the better the  TI  systemperforms. For example, a  TI  system with an 'Avg. Rank' of 1 would be ideal, as it would always return the correct IDat the top of the list. Note that in the  TE  data, not only pro-tein entities, but also genes, m RNA  c  DNA  , and GOMOP s weretagged, where a GOMOP  denotes an entity being either agene, or an m RNA  c  DNA  , or a protein. As shown in Tables 4 and 5, on both datasets, using the gold standard species much improved accuracy of  TI  (e.g.,19.2% on EPPI  data). Also, automatically predicted speciestags were proven to be helpful. On the EPPI  data, the  ML ( EPPI )+ Rule outperformed other systems. Note that thespecies distribution in the devtest dataset is strongly biased to human , which explains why the  ML ( human ) sys-tem performed nearly as well. However, defaulting to human  was not guaranteed to succeed because one wouldnot be able to know the prior species in a collection of unseen documents. Indeed, on the  TE  data, the system  ML ( mouse ), which uses mouse as default, yielded poor results. Experiments on BioCreAtIvE data  To assess the portability of the species tagging approaches,an "artificial" dataset was created by joining the species-specific datasets from BioCreAtIvE 1 & 2 GN  tasks to forma corpus consisting of four species. In detail, four datasets were taken, three from BioCreAtIvE 1 task 1B (i.e., fly,mouse and yeast) and one from BioCreAtIvE 2 task GN (i.e., human). Assuming genes in each dataset are species-specific, we can train/test ML   models for species disam-biguation and apply them to help  TI . This task is more dif-ficult than the srcinal BioCreAtIvE GN  tasks due to theadditional ambiguity caused by multiple model organ-isms. Note that the above assumption is not strictly truebecause each dataset may contain genes of other species,and it would be hard to assess how true it is as abstracts inthe BioCreAtIvE GN  datasets are not normalised to anentity level. We first carried out experiments on species disambigua-tion. In addition to the  TXM  (i.e., the system uses ML  ( EPPI +  TE )+Rule model) and the  Majority Vote taggers, wetrained the species tagger on a dataset comprising of thedevtest sets from the BioCreAtIvE I & II GN  tasks. In moredetail, we first pre-processed the dataset and marked upgene entities with an NER   system [11,18], which was trained on BioCreAtIvE II GM  training and test datasets. The entities were tagged with the species as indicated by the source dataset where they were drawn from, which were used as the 'gold' species. Using the same algorithmand feature set as described previously, a BC model  wastrained. As shown in Table 6, except on human , the  TXM model yielded very disappointing results, whereas the BCmodel did well overall. This was because the  TXM  model was trained on a dataset where  fly and yeast entities occur rarely with only 2% and 5% of the training instancesbelonging to these species, respectively, which againrevealed the influence of the bias introduced in the train-ing material to the ML   models. Table 4: Results (%) of TI  on the EPPI  dataset. All figures, except 'Avg. Rank', are percentages. This evaluation was carried out on protein entities only. MethodPrec@1Prec@5Prec@10Prec@15Prec@20Avg. Rank Baseline54.3173.4576.4477.9078.515.82Gold Species73.5279.3680.7580.7580.991.62Rule54.9973.7276.4577.9178.525.79ML(human)65.6676.3678.8279.7880.032.58ML( EPPI )65.2476.8279.0179.9380.292.39ML( EPPI )+Rule 65.8877.0979.0479.94 80.30 2.36 ML( TE )55.8775.1478.6979.8580.302.86ML( TE )+Rule56.5475.4778.7079.8680.312.83ML( EPPI + TE )64.5576.4878.5379.8380.382.49ML( EPPI + TE )+Rule65.0376.6278.5579.84 80.39 2.46
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks