An expressed sequence tag (EST) set from Citrus sinensis L. Osbeck whole seedlings and the implications of further perennial source investigations

An expressed sequence tag (EST) set from Citrus sinensis L. Osbeck whole seedlings and the implications of further perennial source investigations
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  An expressed sequence tag (EST) set from  Citrus sinensis  L. Osbeckwhole seedlings and the implications of further perennial sourcein v estigations Michael Bausher*, Robert Shatters, Jose Chaparro, Phat Dang, Wayne Hunter,Randall Niedz USDA-ARS-US Horticultural Research Laboratory, Horticultural Breeding, 2001 South Rock Road, Fort Pierce, FL 349453030, USA Recei v ed 6 March 2003; recei v ed in re v ised form 29 April 2003; accepted 29 April 2003 Abstract Compared with the large-scale single pass cDNA sequencing entries from annual plants, the NCBI database has  v ery littlesequence information from perennial plant species. Although similar to annuals in many biochemical pathways, perennials areunique in the fact that they posses long generation times. Without short cycle reproduction as an escape mechanism, perennials ha v ee v ol v ed alternati v e sur v i v al mechanisms to pathogen attack and en v ironmental stresses. The study of these alternate strategies byway of functional genomics will greatly increase the understanding of the biochemical changes underlining stress responses inperennials. Herein we analyze a set of expressed sequence tags (ESTs) produced from a 180-day-old whole immature sweet orangecitrus seedling cDNA library. From this library, 7680 cDNAs were single pass sequenced from the 5 ? end, generating 6443 highquality ESTs. In the analysis, 2272 ESTs (35%) were found to significantly match ( E  - v alue 5 / 10  10 ) proteins with known functionin the public databases using  BLASTX . Additionally, 1457 ESTs (23%) significantly matched proteins with unknown function and1619 ESTs (25%) matched to proteins described as putati v e. The remaining 1095 ESTs (17%) failed to match with significance to anyprotein sequence found in the public databases. ESTs matching to the photosynthetic proteins chlorophyll A/B binding protein,plastocyanin and ribulose-1,5-bisphosphate carboxylase were abundant, 6.0% of the total. Interestingly, stress related proteins suchas low molecular weight heat shock proteins, peroxidases, lipid transfer proteins and metallothionein-like proteins were alsoabundant, 3.6% of the total, suggesting a role for these genes in citrus seedling de v elopment . #  2003 Else v ier Science Ireland Ltd. All rights reserved. Keywords:  Citrus; Cloning; EST; cDNA expression analysis; Metabolism; Plants; Regulation; Seedling 1. Introduction Woody perennial plants are an important group in thephylogeny of plant biology. As a group, they ha v e great v alue as a source of essential oils, secondary products,fla v or components and medicines. Additionally, peren-nial plants ha v e e v ol v ed systems that are resistant toextremes in en v ironmental conditions such as tempera-ture, moisture, light and soil conditions as well as theability to control insect, bacterial, fungal, and  v iralinfections. Understanding the fundamental mechanismsof such systems will help ad v ance research into impro v- ing woody perennial crops. Thus far, woody perennialsha v e recei v ed little attention in the genomic literaturewhen compared with annual systems. Nonetheless,genomic studies in v ol v ing perennials as a group arecertainly needed to understand more complex andspecialized systems potentially unique to perennials.expressed sequence tag (EST) sequencing is a first stepin in v estigating the biochemical pathways that makeperennials systems unique [1]. EST datasets, such as theone described in this manuscript, will pro v ide a resourcethat will allow the scientific community to betterunderstand expression profiles in specific tissues andcan pro v ide statistical clues as to the expression of indi v idual genes when challenged by  v arying en v iron- * Corresponding author. Tel.:  / 1-772-462-5918; fax:  / 1-772-462-5961. E-mail address:  mbausher@ushrl.ars.usda.go v  (M. Bausher).Plant Science 165 (2003) 415    / 422www.else v$ - see front matter # 2003 Else v ier Science Ireland Ltd. All rights reserved.doi:10.1016/S0168-9452(03)00202-4  mental conditions [2,3]. In addition, EST sequencing projects can enable researchers to utilize sequenceinformation as tools in manipulating plant systemswhere other methods ha v e fallen short. For example,due to the difficulty in con v entional citrus breeding andthe lack of large segregating populations for markerassisted selection programs, the introduction of defensegenes in small numbers by genetic engineering wouldseem to be a  v iable alternati v e for crop impro v ement [4].Candidate genes for these studies would include the lateembryogenesis abundant proteins (LEA), heat shockproteins [5], Shikimic acid pathway proteins and iso-prenoid biosynthesis proteins [6,7]. Potential o v er-ex-pression of these genes using of inducible promoterscould be a means of creating transient expressionbeneficial to citrus production as a whole. Transcriptsof these genes and others in v ol v ed in the plant defensemechanisms are found in this citrus EST analysis.Additionally, sequence information generated by ESTsequencing projects can also be useful in functionalgenomics studies by means of DNA arrays [8,9]. Microarray analysis has pro v en an effecti v e tool in theo v erall assessment of plant stress as well as the disco v eryof useful biochemical pathways [6,10]. EST information is also supporti v e in locating and associating phenotypeexpression through single nucleotide polymorphisms(SNP) and marker disco v ery.Here we describe the results of an EST sequencingproject generated from cDNA clones of a  Citrus sinensis whole seedling library. The information obtained fromthis project will help to better understand biochemicalsystems in woody perennials in general and citrus inparticular. 2. Methods  2.1. Library construction Total RNA was isolated from 20 g of 180-day-oldsweet orange ( C. sinensis  L. Osbeck) whole seedlingsgrown under greenhouse conditions (maximum lightintensity of 1200 umol s  1 m  2 and temperature rangesfrom 21  8 C night to 32  8 C day) as described by [11]. Poly(A)  RNA was purified using two rounds of selectionon oligo dT magnetic beads following the manufacturersdirections (Dynal, Oslo, Norway). A directional cDNAlibrary was constructed in the Lambda Uni-ZAP † XR v ector using 5  m g of poly (A)  RNA according to themanufacturers instructions (Stratagene, La Jolla, CA).An amplified library was generated with a titer of 5.2  / 10 10 plaque-forming units per ml. The library was massexcised using Ex-Assist † helper phage (Stratagene) andbacterial clones containing excised pBluescript SK  / phagemids were reco v ered by random colony selection.  2.2. DNA sequencing  Plasmid clones were isolated from 1.7 ml o v ernightLuria    / Bertani broth (ampicillin 100 mg ml  1 ) culturesusing a Qiagen 9600 robot and Qiaprep 96 turboplasmid isolation kits (Qiagen, Valencia, CA). PlasmidDNA (80 ng) was used as a template for ABI Prism † BigDye TM terminator cycle sequencing (PE AppliedBiosystems, Foster City, CA.) from the 5 ?  end of thecloned cDNA using T3 uni v ersal primer. Reactions wereconcentrated and washed by ethanol precipitation andpellets resuspended in 15 ul of formamide prior toseparation on an ABI Prism 3700 Sequencer (USHRL/ARS/USDA Genomics Laboratory, Fort Pierce FL).  2.3. Sequence analysis Raw sequences were assigned base confidence scoresusing Trace Tuner † 2.0 (Paracel, Pasadena, CA) while v ector and end trimming were performed withSequencher † 4.17b (Gene Codes, Ann Arbor, MI). Inthe case of anomalous sequence,  v ector trimming wasedited manually. Contaminants of rRNA, mitochon-drial and chloroplast genes were identified by  BLASTN similarity searches and remo v ed along with sequencesless than 150 nucleotides in length. The remaining ESTswere subjected to similarity searches against the Gen-Bank non-redundant (nr) protein databases using BLASTX  algorithm with default parameters. Sequencesthat returned no significant match were again comparedwith nr nucleotide and EST databases using  BLASTN with default parameters. Search results were formattedusing an in-house parsing program and imported into a MICROSOFT EXCEL † spreadsheet for further analysisand classification. Sequence matches with  E  - v aluescores 5 / 10  10 were considered significant and wereused to categorize the ESTs based on their putati v efunction using a modified MIPS MATDB classificationscheme adopted for  Arabidopsis  ( identify those sequences that represented redun-dant transcripts, ESTs were assembled into contigs usingSequencher † 4.17b (Gene Codes) with a 95% minimummatch and 50 base minimum o v erlap as assemblyparameters. Contig consensus sequence and singletonsequence are collecti v ely referred to as assembledsequences in this analysis. 3. Results 3.1. Sequencing and assembly The cDNA library generated from the whole seedlingsof   C. sinensis  yielded an a v erage insert size of 745 bpsbased on 96 randomly selected clones. A total of 7680 M. Bausher et al. / Plant Science 165 (2003) 415    /  422 416  cDNAs were subjected to 5 ? end single-pass sequencingwhich resulted in an a v erage EST read-length of 511nucleotides following  v ector and low-quality sequencetrimming. The final number of ESTs subsequent toremo v al of rRNA, mitochondrial and chloroplast genes,and filtering for a 150-nucleotide minimum length was6443. A total of 1094 contigs and 2798 singletons wereformed after assembly of the 6443 ESTs from thisdataset. The 1094 contigs encompassed 3645 ESTsresulting in a redundancy of 56%. The largest contigcontained 31 ESTs with 40 contigs containing 10 ESTsor more. The remaining 2798 ESTs failed to assemblebased on our assembly parameters and are, therefore,considered unique ESTs in this dataset. The combinedset of contigs and singlets resulted in 3892 assembledsequences representing putati v e transcripts found in the C. sinensis  whole seedling cDNA library. The librarysummary for this analysis is found in Table 1. 3.2. Similarities found  Examination of the initial  BLASTX  matches showedthat the ESTs could be grouped into four distinctcategories based on the match significance ( E  - v alue)and descriptor of their top hit (Fig. 1A). Group 1consisted of 2272 ESTs that showed significant matches( E  - v alue 5 / 10  10 ) to protein sequences with knownfunction in the public databases. Group 2 consisted of 1619 ESTs in which the top match was significant ( E  - v alue 5 / 10  10 ), howe v er, the protein was described asputati v e (this analysis, considered similarity searchmatch designations of   *  / like, similar to, has homologyto, and  *  / related as proteins with putati v e function).Group 3 ESTs represent 1457 sequences in which the tophit was significant ( E  - v alue 5 / 10  10 ), howe v er, theprotein had unknown function. Group 4 consisted of 1095 ESTs that ha v e no significant match ( E  - v alue  / 10  10 ) to any protein sequence in the public databases.To address the issue of no significant match with regardsto group 4 ESTs a distribution based on query sequencelength  v ersus significant hits was constructed (Fig. 2). Itillustrates that as query sequence length decreases, thereis an increased probability of returning no significantmatch to protein databases. At query sequence lengthsof   / 551 bases, the expected percentage of no significantmatch for a sequence is consistent  *  / approximately 10%.Whereas, sequence lengths of 151    / 250 bp through 551    / 650 bp show a linear relationship reaching a maximumof 60% in the smallest (151    / 250 bp) group.In addition to protein searches using  BLASTX , group 4ESTs were also searched against the nr nucleotide andEST databases using  BLASTN . The results showed that492 (45%) ha v e been found in other organisms and 603(55%) are completely no v el. Table 1 C. sinensis  whole seedling EST sequencing project summary Library and EST summary Mean insert size 745 base pairsLibrary titer 5.2  / 10 10 pfu ml  1 Number of cDNAs sequenced 7680Mean EST length 511 basesNumber of high quality ESTs 6443 Contig assembly results Number of ESTs assembled 6443Number of contigs 1094Number of singletons 2798Number of assembled sequences 3892 Contig sizes 2    / 4 ESTs 9145    / 7 ESTs 1098    / 10 ESTs 4211    / 13 ESTs 11  / 14 ESTs 18Mean EST length following  v ector and end clipping. EST assemblyparameters were 95% minimum match with 50 minimum base o v erlap.Assembled sequences are the sum of contigs and singleton.Fig. 1. (A) Graphical representation of ESTs from the  C. sinensis whole seedling library based on  E  - v alues and descriptors of topmatches using  BLASTX . Group 1 ESTs ha v e significant similarity toknown proteins; Group 2 ESTs ha v e significant similarity to proteinswith the descriptor putati v e, similar to, -like, contains homology toand has similarity to; Group 3 ESTs ha v e significant similarity toproteins with unknown function; Group 4 ESTs lack any significantmatch to public protein databases. (B) Group 1 EST distribution usingfunctional categories based on a modified MIPS classification system. M. Bausher et al. / Plant Science 165 (2003) 415    /  422  417  Group 1 ESTs were assigned putati v e gene functionbased on the initial  BLASTX  matches using a modifiedMIPS MATDB  Arabidopsis  classification scheme (Fig.1B). For purposes of the analysis, MIPS sub-categoriesphotosynthesis, defense/stress and secondary metabo-lism were used as major categories and the categoryprotein synthesis/processing was a combination of thefollowing MIPS categories: protein synthesis, transcrip-tion, and protein destination. In the analysis, the twolargest categories of group 1 ESTs were photosynthesisand protein synthesis/processing with 27 and 23%,respecti v ely. Metabolism matched 14% of the ESTswith 9% matching proteins related to cellular biogenesis.The category defense/stress consisted of 8% of the ESTsfollowed by energy and secondary metabolism each with7%. Lastly, cell growth/di v ision with 4% and cellularcommunication with 1% of the ESTs found in Group 1.Highly abundant transcripts found in this dataset arelisted in Table 2. The largest representation was thechlorophyll A/B binding protein super-family showingsimilarity to 309 ESTs. Metallothionein-like proteins, 60S ribosomal protein and ribulose-1,5-biphospate car-boxylase (Rubisco) were also highly represented. Lipidtransfer protein, ATP synthase components, plastocya-nin, peroxidase and polyubiquitin were also abundant inthis analysis. Together these nine protein types ac-counted for 654 ESTs or 10.1% of the total. Accessionnumbers for the most significant match as well as  E  - v alues are displayed on Table 2.The largest contigs based on consensus sequencelength found in the analysis are depicted in Table 3. BLASTX  matches to consensus sequence re v ealed thatse v eral contigs contained complete open reading frames,demonstrating the utility of EST sequencing in conjunc-tion with EST assembly to obtain complete codingregion sequence of transcripts from EST sequencingprojects. Contigs matching to orcinol- O -methyltransfer-ase from almond  Prunus dulcis  (accession  # CAA11131)and aminomethyltransferase-like precursor protein from Arabidopsis thaliana  (accession  # At1g11860) both ex-hibited complete open-reading frame sequence. Se v eralother contigs matching to H  transporting two sectorATPase from  Citrus unshiu , Aldehyde dehydrogenase(NAD  / )-like protein from  A. thaliana  and peroxidase Fig. 2. Distribution of non-significant  E  - v alues (  / e-10) deri v ed from  BLASTX  similarity searches as a function of query sequence length. The  x -axisrepresents the length of query sequence in nucleotides. The  y -axis represents the number of non-significant  E  - v alues as a percent of the totalsequences returned following  BLASTX  searches within a base length group.Table 2Abundant ESTs found in the  C. sinensis  whole seedling libraryGene function Count a E  - v alue b Accession number c Chlorophyll A/B binding protein 309 e-120 AAC15992Metallothionein-like protein 83 e-36 BAA3156260 S ribosomal protein 53 e-093 P29766Ribulose-1,5-bisphosphate carboxylase 50 e-119 PO8926Lipid transfer protein 40 e-60 AAM21292H  / -transporting ATP synthase, chain B 33 e-107 T43789Plastocyanin 30 e-55 P17340Peroxidase 30 e-89 BAA82306Polyubiquitin 26 e-114 CAA66667Late embryogenesis abundant protein (LEA) 23 e-49 Q39644Translationally controlled tumor protein 23 e-78 Q9ZSW9 a Number of ESTs assigned to the same gene function. b Represents the lowest  BLASTX  score returned for the EST group. c Accession number for the lowest  BLASTX  score for the EST group. M. Bausher et al. / Plant Science 165 (2003) 415    /  422 418  from  Gossypium hirsutum  were complete to within 15amino acids from the translation start site. Consensussequence for these two contigs are being submitted toGenbank. 4. Discussion Analysis of 7680 cDNAs from the  C. sinensis  wholeseedling library ga v e rise to 6443 high quality ESTs withan a v erage sequence length of 511 bases. To maintainthe o v erall quality of the sequence data reported in thiswork, careful attention was taken to eliminate contam-inating ribosomal RNA, chloroplast, mitochondrialgenes and  v ector sequence by multiple searches againstthe non-redundant nucleotide, mitochondria, and  v ectordatabases at NCBI. A total of 3892 assembled sequences(sum of contigs and singletons) were identified in theanalysis, 1094 being contigs of two ESTs or more andthe remainder singletons. This resulted in a redundancyof 56%. Howe v er, this  v alue is only an approximationdue to errors in sequencing, chimeric cDNAs, trunca-tion, and non-o v erlapping bases of the same gene thataffect EST assembly and consequently EST redundancy.Similarity searches of our ESTs showed that 35%(Group 1) shared significant similarity with character-ized protein sequences from other organisms in the nrprotein database at NCBI. In addition, 25% (Group 2)were significantly similar to proteins with putati v efunction and 23% (Group 3) were similar to proteinswith unknown function (Fig. 1A). Combined, 83% of the ESTs from this dataset had significantly matched toprotein sequences found in the public databases. Thispercentage was larger than that found in other plantEST projects [12    / 15], and is most likely due to theexponential growth of submitted sequences from thelarge genome projects such as  Arabidopsis  that were notpresent at the time those studies were conducted.Of importance to  Citrus  genomics are the ESTs thatrepresented 17% (Group 4) of the total that showed nosignificant match to any protein sequence found in thepublic databases. When searched against the non-redundant nucleotide and EST databases at NCBI using BLASTN , 55% (603) of Group 4 ESTs again resulted inno significant match, suggesting that these transcriptscould potentially represent important genes specific to Citrus . Further studies of these genes could re v ealprimary and secondary defense systems as well asunique biosynthetic pathways for the production of secondary products not yet disco v ered in other systems.When reporting the potential number of uniquetranscripts in an EST dataset it must be kept in mindthat because similarity searches to protein databases relyon query sequence containing protein coding regions,shorter cDNAs may not gi v e an accurate reflection of their uniqueness. We in v estigated this by determininghow query sequence length affects  BLASTX  searchresults. Not surprisingly, we found that a reduction inquery sequence length increases the probability of notfinding a significant match in protein databases (Fig. 2).One explanation is that in plant species, the a v erage 3 ? untranslated region (UTR) is thought to be 250 bases inlength [16]. Hence, sequences with shorter lengths, as inthe 151    / 250 group, may consist of primarily 3 ?  UTRwith little to no protein coding sequence. This isespecially true for ESTs in which polyadenylation signalmotifs, or actual poly(A) sequence can be discerned.Consequently, similarity searches with short ESTs, will Table 3Largest contigs based on sequence length from the  C. sinensis  whole seedling libraryPutati v e gene function Consensus length in nucleotides Accession number a Source organism b Aldehyde dehydrogenase (NAD  / )- like protein 1636 NP1900383  A. thaliana thaliana Unknown protein 1566 NP1922181  A. thaliana thaliana Pyrophosphate-dependent phosphofructokinase 1477 AAC67587  Citrus paradisi  Aminomethyltransferase-like precursor protein 1440 NP172650  A. thaliana thaliana Peroxidase 1424 T10790  G. hirsutum Rac GTPase acti v ating protein I 1345 AAC62624  Lotus japonicus H  / transporting two-sector ATPase 1329 T43789  Citrus unshiu 3-deoxy- D -arabino-heptuosonate 7-phospahte synthase 1283 CAA75092  Morinda citrifolia Permease 1272 NP176211  A. thaliana thaliana Orcinol  O -methyltransferase 1263 AAM23005  Rosa  hybrid culti v arPutati v e protein 1260 NP568605  A. thaliana thaliana Hypothetical protein 1258 ZP00129858  Desulfo v ibrio desulfuricans Carbonic anhydrase 1250 AAM22683  G. hirsutum a Represents the accession number from the most significant  BLASTX  match. b Represents the source organism from the most significant  BLASTX  match. M. Bausher et al. / Plant Science 165 (2003) 415    /  422  419
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks