Science

ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Description
1. Extrac'ng pa,erns of database and so3ware usage from the bioinforma'cs literature Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert…
Categories
Published
of 26
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  • 1. Extrac'ng pa,erns of database and so3ware usage from the bioinforma'cs literature Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert Stevens The University of Manchester, UK h,p://www.cs.man.ac.uk/~duckg/ h,p://bionerds.sourceforge.net/networks/
  • 2. Introduc'on • Methods are fundamental to science – Judgement – Replica'on – Extension • Methods in bioinforma'cs: – In silico: Data and tools – Workflows • Objec've representa'on • Sharing and reuse 2
  • 3. Bioinforma'cs • Resource focused domain: “Resourceome” – Our research suggests: • Around 200,000 unique resources in the literature • Over 4 million men'ons • … and s'll growing! • Resource/method search and selec'on… – Best-­‐prac'ce – Common-­‐prac'ce • What are the main pa,erns in bioinforma'cs resources, and associated methods? 3
  • 4. Approach • Use bioinforma'cs literature (to answer this ques'on) • Extract database and so3ware men'ons • Combine resources to form pairs • Combine pairs to forms pa,erns – Common-­‐prac'ce – Method? BLAST Modeller PROCHECK 4 ClustalW PHYLIP
  • 5. Document Collec'on • PubMed Central open-­‐access full-­‐text ar'cles • Bioinforma2cs[MeSH] • 22,376 ar'cles • 67 journals • 3 journals were > 50% of total documents 5 (%#!!" (%!!!" '%#!!" '%!!!" &%#!!" &%!!!" $%#!!" $%!!!" #!!" !" $))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$(" !"#$%&'()'*(+"#%,-.' /%0&'
  • 6. bioNerDS • bioNerDS – Bioinforma'cs named en'ty recogniser for databases and so3ware – Full-­‐text; Men'on level – Rule-­‐based – F-­‐score 63-­‐91% – Previously compared resource usage in: • Genome Biology • BMC Bioinforma'cs • Networks filter: – 702,937 total men'ons – 167,697 document level men'ons – 31,053 unique names – 93% single men'on • Duck et al. (2013) BMC Bioinforma'cs h,p://bionerds.sourceforge.net/ 6
  • 7. bioNerDS Genome Biology • “Biological” focus – GenBank – Ensembl – GEO – GO BMC Bioinforma6cs • “Resource” focus – R – PDB – PubMed h,p://bionerds.sourceforge.net/ 7
  • 8. Men'on Filtering • Filter resources not men'oned within a minimum of 2 documents – Removed 25% of men'ons – Removes less likely names • Generic resources – R – Bioconductor • Categorise to database/so3ware – Removed some ‘unknown’ resources 8
  • 9. Methods Sec'ons • Removed resources not in the methods sec'on – Method or non-­‐method • Regular expression based 'tle detec'on – Tested on 100 ar'cles – Precision: 97%; Recall: 79% • Resul'ng in: – 69,466 database men'ons (1,711 unique) – 65,451 so3ware men'ons (3,289 unique) 9
  • 10. Extrac'ng Pairs • Co-­‐occurrence within text • Two sets of pairs: – So3ware only pairs – Database and so3ware pairs (any combina'on of) • This provided us with: – 22,880 so3ware pairs (13,965 unique) – 54,562 database/so3ware pairs (29,066 unique) • Removed pairs only within a single document – 53% of the so3ware pairs – 46% of the database/so3ware pairs 10
  • 11. Common Pairs • With sufficient data, the most common order of a pairing is the correct one… • Binomial test – each order is equally likely • Two confidence thresholds: – 95% • 2,518 so3ware pairs (145 unique) • 7,001 database/so3ware pairs (297 unique) – 99% • 1,450 so3ware pairs (55 unique) • 3,383 database/so3ware pairs (95 unique) 11
  • 12. Most Common Pairs SoAware only pairs Directed Pair Count % BLAST è ClustalW 205 14.1 BLAST è PSI-­‐BLAST 103 7.1 Phred è Phrap 89 6.1 ClustalW è MEGA 77 5.3 Cluster è Tree View 75 5.2 Phrap è Consed 51 3.5 Modeller è PROCHECK 44 3.0 BLAST è ClustalX 43 3.0 ClustalW è PHYLIP 41 2.8 BLAST è MUSCLE 40 2.8 SoAware and database pairs Direct Pair Count % GO è KEGG 350 10.3 BLAST è GO 195 5.8 BLAST è ClustalW 150 4.4 GEO è GO 129 3.8 Phred è Phrap 89 2.6 BLAST è PSI-­‐BLAST 87 2.6 PDB è Modeller 85 2.5 Swiss-­‐Prot è TrEMBL 82 2.4 Ensembl è BioMart 82 2.4 ClustalW è MEGA 77 2.3 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. Resource Pa,erns Databases • Data sources • GO is an excep'on – Major sink – Data Annota'on • Numerous ‘same’ links – Enumera'on in text? SoAware • Data sinks • Represents the primary in silico pipeline(s) • Again, sequence alignment is central 19
  • 20. Pa,erns through Time 2004 to 2006                     2007 to 2009                                    20
  • 21. Pa,erns through Time                                                        21 2010 to 2012
  • 22. Phylogene'cs Pa,erns • Case-­‐study… • Eales et al. (2008) BMC Bioinforma2cs, 9, 359 – Mapped phylogene'cs methods into 4 steps: • Sequence Alignment • Tree Inference • Sta's'cal Tes'ng • Tree Visualisa'on – Using the same corpus selec'on, we built a network… • PubMed search for “phylogen*” in 'tles or abstracts 22
  • 23. Phylogene'cs Pa,erns                                             23
  • 24. Phylogene'cs Pa,erns • Our automated extrac'on can recreate these steps – Given some ambiguous resources • Encouraging… – Viable in silico pa,ern extrac'on – “Common prac'ce” • Next step: Apply this to other (sub-­‐)domains 24
  • 25. Conclusion • Can extract pa,erns of resource usage – Can we describe the method through these? • High level overview of common-­‐prac'ce – With lower thresholds, can access resources specific (but “common”) to different subdomains – Not best-­‐prac'ce… • Workflows? – Requires increased granularity – Could help inform their crea'on 25
  • 26. Thank-­‐you • Acknowledgements – Co-­‐authors – Manchester IT Services • Computa'onal facili'es – Funding: – Travel: 26
  • Search
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x